RAG Model Documentation Demo

In this notebook, we are going to implement a simple RAG Model for automating the process of answering RFP questions using GenAI. We will see how we can initialize an embedding model, a retrieval model and a generator model with LangChain components and use them within the ValidMind library to run tests against them. Finally, we will see how we can put them together in a Pipeline and run that to get e2e results and run tests against that.

About ValidMind

ValidMind is a platform for managing model risk, including risk associated with AI and statistical models.

You use the ValidMind Library to automate documentation and validation tests, and then use the ValidMind AI Risk Platform UI to collaborate on model documentation. Together, these products simplify model risk management, facilitate compliance with regulations and institutional standards, and enhance collaboration between yourself and model validators.

Before you begin

This notebook assumes you have basic familiarity with Python, including an understanding of how functions work. If you are new to Python, you can still run the notebook but we recommend further familiarizing yourself with the language.

If you encounter errors due to missing modules in your Python environment, install the modules with pip install, and then re-run the notebook. For more help, refer to Installing Python Modules.

New to ValidMind?

If you haven’t already seen our Get started with the ValidMind Library, we recommend you explore the available resources for developers at some point. There, you can learn more about documenting models, find code samples, or read our developer reference.

For access to all features available in this notebook, create a free ValidMind account.

Signing up is FREE — Register with ValidMind

Key concepts

  • FunctionModels: ValidMind offers support for creating VMModel instances from Python functions. This enables us to support any “model” by simply using the provided function as the model’s predict method.
  • PipelineModels: ValidMind models (VMModel instances) of any type can be piped together to create a model pipeline. This allows model components to be created and tested/documented independently, and then combined into a single model for end-to-end testing and documentation. We use the | operator to pipe models together.
  • RAG: RAG stands for Retrieval Augmented Generation and refers to a wide range of GenAI applications where some form of retrieval is used to add context to the prompt so that the LLM that generates content can refer to it when creating its output. In this notebook, we are going to implement a simple RAG setup using LangChain components.

Pre-requisites

Let’s go ahead and install the validmind library if its not already installed… Then we can install the qdrant-client library for our vector store and langchain for everything else:

%pip install -q validmind
%pip install -q qdrant-client langchain langchain-openai sentencepiece

Initialize the client library

ValidMind generates a unique code snippet for each registered model to connect with your developer environment. You initialize the client library with this code snippet, which ensures that your documentation and tests are uploaded to the correct model when you run the notebook.

Get your code snippet:

  1. In a browser, log in to ValidMind.

  2. In the left sidebar, navigate to Model Inventory and click + Register new model.

  3. Enter the model details and click Continue. (Need more help?)

    For example, to register a model for use with this notebook, select:

    • Documentation template: Gen AI RAG Template
    • Use case: Analytics

    You can fill in other options according to your preference.

  4. Go to Getting Started and click Copy snippet to clipboard.

Next, replace this placeholder with your own code snippet:

# Replace with your code snippet

import validmind as vm

vm.init(
  api_host = "https://api.prod.validmind.ai/api/v1/tracking",
  api_key = "...",
  api_secret = "...",
  model = "..."
)

Read Open AI API Key

We will need to have an OpenAI API key to be able to use their text-embedding-3-small model for our embeddings, gpt-3.5-turbo model for our generator and gpt-4o model for our LLM-as-Judge tests. If you don’t have an OpenAI API key, you can get one by signing up at OpenAI. Then you can create a .env file in the root of your project and the following cell will load it from there. Alternatively, you can just uncomment the line below to directly set the key (not recommended for security reasons).

# load openai api key
import os

import dotenv
import nltk

dotenv.load_dotenv()
nltk.download('stopwords')
nltk.download('punkt_tab')

# os.environ["OPENAI_API_KEY"] = "sk-..."

if not "OPENAI_API_KEY" in os.environ:
    raise ValueError("OPENAI_API_KEY is not set")

Dataset Loader

Great, now that we have all of our dependencies installed, the library initialized and connected to our model and our OpenAI API key setup, we can go ahead and load our datasets. We will use the synthetic RFP dataset included with ValidMind for this notebook. This dataset contains a variety of RFP questions and ground truth answers that we can use both as the source where our Retriever will search for similar question-answer pairs as well as our test set for evaluating the performance of our RAG model. To do this, we just have to load it and call the preprocess function to get a split of the data into train and test sets.

# Import the sample dataset from the library
from validmind.datasets.llm.rag import rfp

raw_df = rfp.load_data()
train_df, test_df = rfp.preprocess(raw_df)
vm_train_ds = vm.init_dataset(
    train_df,
    text_column="question",
    target_column="ground_truth",
)

vm_test_ds = vm.init_dataset(
    test_df,
    text_column="question",
    target_column="ground_truth",
)

vm_test_ds.df.head()

Data validation

Now that we have loaded our dataset, we can go ahead and run some data validation tests right away to start assessing and documenting the quality of our data. Since we are using a text dataset, we can use ValidMind’s built-in array of text data quality tests to check that things like number of duplicates, missing values, and other common text data issues are not present in our dataset. We can also run some tests to check the sentiment and toxicity of our data.

Duplicates

First, let’s check for duplicates in our dataset. We can use the validmind.data_validation.Duplicates test and pass our dataset:

result = vm.tests.run_test(
    test_id="validmind.data_validation.Duplicates",
    inputs={"dataset": vm_train_ds},
)
result.log()

Stop Words

Next, let’s check for stop words in our dataset. We can use the validmind.data_validation.StopWords test and pass our dataset:

vm.tests.run_test(
    test_id="validmind.data_validation.nlp.StopWords",
    inputs={
        "dataset": vm_train_ds,
    },
).log()

Punctuations

Next, let’s check for punctuations in our dataset. We can use the validmind.data_validation.Punctuations test:

vm.tests.run_test(
    test_id="validmind.data_validation.nlp.Punctuations",
    inputs={
        "dataset": vm_train_ds,
    },
).log()

Common Words

Next, let’s check for common words in our dataset. We can use the validmind.data_validation.CommonWord test:

vm.tests.run_test(
    test_id="validmind.data_validation.nlp.CommonWords",
    inputs={
        "dataset": vm_train_ds,
    },
).log()

Language Detection

For documentation purposes, we can detect and log the languages used in the dataset with the validmind.data_validation.LanguageDetection test:

vm.tests.run_test(
    test_id="validmind.data_validation.nlp.LanguageDetection",
    inputs={
        "dataset": vm_train_ds,
    },
).log()

Toxicity Score

Now, let’s go ahead and run the validmind.data_validation.nlp.Toxicity test to compute a toxicity score for our dataset:

vm.tests.run_test(
    "validmind.data_validation.nlp.Toxicity",
    inputs={
        "dataset": vm_train_ds,
    },
).log()

Polarity and Subjectivity

We can also run the validmind.data_validation.nlp.PolarityAndSubjectivity test to compute the polarity and subjectivity of our dataset:

vm.tests.run_test(
    "validmind.data_validation.nlp.PolarityAndSubjectivity",
    inputs={
        "dataset": vm_train_ds,
    },
).log()

Sentiment

Finally, we can run the validmind.data_validation.nlp.Sentiment test to plot the sentiment of our dataset:

vm.tests.run_test(
    "validmind.data_validation.nlp.Sentiment",
    inputs={
        "dataset": vm_train_ds,
    },
).log()

Embedding Model

Now that we have our dataset loaded and have run some data validation tests to assess and document the quality of our data, we can go ahead and initialize our embedding model. We will use the text-embedding-3-small model from OpenAI for this purpose wrapped in the OpenAIEmbeddings class from LangChain. This model will be used to “embed” our questions both for inserting the question-answer pairs from the “train” set into the vector store and for embedding the question from inputs when making predictions with our RAG model.

from langchain_openai import OpenAIEmbeddings

embedding_client = OpenAIEmbeddings(model="text-embedding-3-small")


def embed(input):
    """Returns a text embedding for the given text"""
    return embedding_client.embed_query(input["question"])


vm_embedder = vm.init_model(input_id="embedding_model", predict_fn=embed)

What we have done here is to initialize the OpenAIEmbeddings class so it uses OpenAI’s text-embedding-3-small model. We then created an embed function that takes in an input dictionary and uses the embed_query method of the embedding client to compute the embeddings of the question. We use an embed function since that is how ValidMind supports any custom model. We will use this strategy for the retrieval and generator models as well but you could also use, say, a HuggingFace model directly. See the documentation for more information on which model types are directly supported - ValidMind Documentation… Finally, we use the init_model function from the ValidMind framework to create a VMModel object that can be used in ValidMind tests. This also logs the model to our model documentation and any test that uses the model will be linked to the logged model and its metadata.

Assign Predictions

To precompute the embeddings for our test set, we can call the assign_predictions method of our vm_test_ds object we created above. This will compute the embeddings for each question in the test set and store them in the a special prediction column of the test set thats linked to our vm_embedder model. This will allow us to use these embeddings later when we run tests against our embedding model.

vm_test_ds.assign_predictions(vm_embedder)
print(vm_test_ds)

Run tests

Now that everything is setup for the embedding model, we can go ahead and run some tests to assess and document the quality of our embeddings. We will use the validmind.model_validation.embeddings.* tests to compute a variety of metrics against our model.

from validmind.tests import run_test

result = run_test(
    "validmind.model_validation.embeddings.StabilityAnalysisRandomNoise",
    inputs={
        "model": vm_embedder,
        "dataset": vm_test_ds,
    },
    params={"probability": 0.3},
).log()
result = run_test(
    "validmind.model_validation.embeddings.StabilityAnalysisSynonyms",
    inputs={
        "model": vm_embedder,
        "dataset": vm_test_ds,
    },
    params={"probability": 0.3},
).log()
result = run_test(
    "validmind.model_validation.embeddings.StabilityAnalysisTranslation",
    inputs={
        "model": vm_embedder,
        "dataset": vm_test_ds,
    },
    params={
        "source_lang": "en",
        "target_lang": "fr",
    },
).log()
result = run_test(
    "validmind.model_validation.embeddings.CosineSimilarityHeatmap",
    inputs={
        "model": vm_embedder,
        "dataset": vm_test_ds,
    },
).log()
result = run_test(
    "validmind.model_validation.embeddings.CosineSimilarityDistribution",
    inputs={
        "model": vm_embedder,
        "dataset": vm_test_ds,
    },
).log()
result = run_test(
    "validmind.model_validation.embeddings.EuclideanDistanceHeatmap",
    inputs={
        "model": vm_embedder,
        "dataset": vm_test_ds,
    },
).log()
result = run_test(
    "validmind.model_validation.embeddings.PCAComponentsPairwisePlots",
    inputs={
        "model": vm_embedder,
        "dataset": vm_test_ds,
    },
    params={"n_components": 3},
).log()
result = run_test(
    "validmind.model_validation.embeddings.TSNEComponentsPairwisePlots",
    inputs={
        "model": vm_embedder,
        "dataset": vm_test_ds,
    },
    params={"n_components": 3, "perplexity": 20},
).log()

Setup Vector Store

Great, so now that we have assessed our embedding model and verified that it is performing well, we can go ahead and use it to compute embeddings for our question-answer pairs in the “train” set. We will then use these embeddings to insert the question-answer pairs into a vector store. We will use an in-memory qdrant vector database for demo purposes but any option would work just as well here. We will use the QdrantClient class from LangChain to interact with the vector store. This class will allow us to insert and search for embeddings in the vector store.

Generate embeddings for the Train Set

We can use the same assign_predictions method from earlier except this time we will use the vm_train_ds object to compute the embeddings for the question-answer pairs in the “train” set.

vm_train_ds.assign_predictions(vm_embedder)
print(vm_train_ds)

Insert embeddings and questions into Vector DB

Now that we have computed the embeddings for our question-answer pairs in the “train” set, we can go ahead and insert them into the vector store:

from langchain_community.vectorstores import Qdrant
from langchain_openai import OpenAIEmbeddings
from langchain_community.document_loaders import DataFrameLoader

# load documents from dataframe
loader = DataFrameLoader(train_df, page_content_column="question")
docs = loader.load()
# choose model using embedding client
embedding_client = OpenAIEmbeddings(model="text-embedding-3-small")

# setup vector datastore
qdrant = Qdrant.from_documents(
    docs,
    embedding_client,
    location=":memory:",  # Local mode with in-memory storage only
    collection_name="rfp_rag_collection",
)

Retrieval Model

Now that we have an embedding model and a vector database setup and loaded with our data, we need a Retrieval model that can search for similar question-answer pairs for a given input question. Once created, we can initialize this as a ValidMind model and assign_predictions to it just like our embedding model.

def retrieve(input):
    contexts = []

    for result in qdrant.similarity_search_with_score(input["question"]):
        document, score = result
        context = f"Q: {document.page_content}\n"
        context += f"A: {document.metadata['ground_truth']}\n"

        contexts.append(context)

    return contexts


vm_retriever = vm.init_model(input_id="retrieval_model", predict_fn=retrieve)
vm_test_ds.assign_predictions(model=vm_retriever)
print(vm_test_ds)

Generation Model

As the final piece of this simple RAG pipeline, we can create and initialize a generation model that will use the retrieved context to generate an answer to the input question. We will use the gpt-3.5-turbo model from OpenAI.

from openai import OpenAI

from validmind.models import Prompt


system_prompt = """
You are an expert RFP AI assistant.
You are tasked with answering new RFP questions based on existing RFP questions and answers.
You will be provided with the existing RFP questions and answer pairs that are the most relevant to the new RFP question.
After that you will be provided with a new RFP question.
You will generate an answer and respond only with the answer.
Ignore your pre-existing knowledge and answer the question based on the provided context.
""".strip()

openai_client = OpenAI()


def generate(input):
    response = openai_client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": "\n\n".join(input["retrieval_model"])},
            {"role": "user", "content": input["question"]},
        ],
    )

    return response.choices[0].message.content


vm_generator = vm.init_model(
    input_id="generation_model",
    predict_fn=generate,
    prompt=Prompt(template=system_prompt),
)

Let’s test it out real quick:

import pandas as pd

vm_generator.predict(
    pd.DataFrame(
        {"retrieval_model": [["My name is anil"]], "question": ["what is my name"]}
    )
)

Prompt Evaluation

Now that we have our generator model initialized, we can run some LLM-as-Judge tests to evaluate the system prompt. This will allow us to get an initial sense of how well the prompt meets a few best practices for prompt engineering. These tests use an LLM to rate the prompt on a scale of 1-10 against the following criteria:

  • Examplar Bias: When using multi-shot prompting, does the prompt contain an unbiased distribution of examples?
  • Delimitation: When using complex prompts containing examples, contextual information, or other elements, is the prompt formatted in such a way that each element is clearly separated?
  • Clarity: How clearly the prompt states the task.
  • Conciseness: How succinctly the prompt states the task.
  • Instruction Framing: Whether the prompt contains negative instructions.
  • Specificity: How specific the prompt defines the task.
result = run_test(
    "validmind.prompt_validation.Bias",
    inputs={
        "model": vm_generator,
    },
).log()
result = run_test(
    "validmind.prompt_validation.Clarity",
    inputs={
        "model": vm_generator,
    },
).log()
result = run_test(
    "validmind.prompt_validation.Conciseness",
    inputs={
        "model": vm_generator,
    },
).log()
result = run_test(
    "validmind.prompt_validation.Delimitation",
    inputs={
        "model": vm_generator,
    },
).log()
result = run_test(
    "validmind.prompt_validation.NegativeInstruction",
    inputs={
        "model": vm_generator,
    },
).log()
result = run_test(
    "validmind.prompt_validation.Specificity",
    inputs={
        "model": vm_generator,
    },
).log()

Setup RAG Pipeline Model

Now that we have all of our individual “component” models setup and initialized we need some way to put them all together in a single “pipeline”. We can use the PipelineModel class to do this. This ValidMind model type simply wraps any number of other ValidMind models and runs them in sequence. We can use a pipe(|) operator - in Python this is normally an or operator but we have overloaded it for easy pipeline creation - to chain together our models. We can then initialize this pipeline model and assign predictions to it just like any other model.

vm_rag_model = vm.init_model(vm_retriever | vm_generator, input_id="rag_model")

We can assign_predictions to the pipeline model just like we did with the individual models. This will run the pipeline on the test set and store the results in the test set for later use.

vm_test_ds.assign_predictions(model=vm_rag_model)
print(vm_test_ds)
vm_test_ds._df.head(5)

Run tests

RAGAS evaluation

Let’s go ahead and run some of our new RAG tests against our model…

Note: these tests are still being developed and are not yet in a stable state. We are using advanced tests here that use LLM-as-Judge and other strategies to assess things like the relevancy of the retrieved context to the input question and the correctness of the generated answer when compared to the ground truth. There is more to come in this area so stay tuned!

import warnings

warnings.filterwarnings("ignore")

Answer Similarity

The concept of Answer Semantic Similarity pertains to the assessment of the semantic resemblance between the generated answer and the ground truth. This evaluation is based on the ground truth and the answer, with values falling within the range of 0 to 1. A higher score signifies a better alignment between the generated answer and the ground truth.

Measuring the semantic similarity between answers can offer valuable insights into the quality of the generated response. This evaluation utilizes a cross-encoder model to calculate the semantic similarity score.

run_test(
    "validmind.model_validation.ragas.AnswerSimilarity",
    inputs={"dataset": vm_test_ds},
    params={
        "question_column": "question",
        "answer_column": "rag_model_prediction",
        "ground_truth_column": "ground_truth",
        "contexts_column": "retrieval_model_prediction",
    },
).log()

Context Entity Recall

This test gives the measure of recall of the retrieved context, based on the number of entities present in both ground_truths and contexts relative to the number of entities present in the ground_truths alone. Simply put, it is a measure of what fraction of entities are recalled from ground_truths. This test is useful in fact-based use cases like tourism help desk, historical QA, etc. This test can help evaluate the retrieval mechanism for entities, based on comparison with entities present in ground_truths, because in cases where entities matter, we need the contexts which cover them.

result = run_test(
    "validmind.model_validation.ragas.ContextEntityRecall",
    inputs={"dataset": vm_test_ds},
    params={
        "question_column": "question",
        "answer_column": "rag_model_prediction",
        "ground_truth_column": "ground_truth",
        "contexts_column": "retrieval_model_prediction",
    },
)
result.log()

Context Precision

Context Precision is a test that evaluates whether all of the ground-truth relevant items present in the contexts are ranked higher or not. Ideally all the relevant chunks must appear at the top ranks. This test is computed using the question, ground_truth and the contexts, with values ranging between 0 and 1, where higher scores indicate better precision.

result = run_test(
    "validmind.model_validation.ragas.ContextPrecision",
    inputs={"dataset": vm_test_ds},
    params={
        "question_column": "question",
        "answer_column": "rag_model_prediction",
        "ground_truth_column": "ground_truth",
        "contexts_column": "retrieval_model_prediction",
    },
)
result.log()

Faithfulness

This measures the factual consistency of the generated answer against the given context. It is calculated from answer and retrieved context. The answer is scaled to (0,1) range. Higher the better.

The generated answer is regarded as faithful if all the claims that are made in the answer can be inferred from the given context. To calculate this a set of claims from the generated answer is first identified. Then each one of these claims are cross checked with given context to determine if it can be inferred from given context or not.

result = run_test(
    "validmind.model_validation.ragas.Faithfulness",
    inputs={"dataset": vm_test_ds},
    params={
        "question_column": "question",
        "answer_column": "rag_model_prediction",
        "ground_truth_column": "ground_truth",
        "contexts_column": "retrieval_model_prediction",
    },
)
result.log()

Answer Relevance

The Answer Relevancy test, focuses on assessing how pertinent the generated answer is to the given prompt. A lower score is assigned to answers that are incomplete or contain redundant information and higher scores indicate better relevancy. This test is computed using the question, the context and the answer.

The Answer Relevancy is defined as the mean cosine similartiy of the original question to a number of artifical questions, which where generated (reverse engineered) based on the answer.

Please note, that eventhough in practice the score will range between 0 and 1 most of the time, this is not mathematically guranteed, due to the nature of the cosine similarity ranging from -1 to 1.

Note: This is a reference free test. If you’re looking to compare ground truth answer with generated answer refer to Answer Correctness.

An answer is deemed relevant when it directly and appropriately addresses the original question. Importantly, our assessment of answer relevance does not consider factuality but instead penalizes cases where the answer lacks completeness or contains redundant details. To calculate this score, the LLM is prompted to generate an appropriate question for the generated answer multiple times, and the mean cosine similarity between these generated questions and the original question is measured. The underlying idea is that if the generated answer accurately addresses the initial question, the LLM should be able to generate questions from the answer that align with the original question.

result = run_test(
    "validmind.model_validation.ragas.AnswerRelevance",
    inputs={"dataset": vm_test_ds},
    params={
        "question_column": "question",
        "answer_column": "rag_model_prediction",
        "ground_truth_column": "ground_truth",
        "contexts_column": "retrieval_model_prediction",
    },
)
result.log()

Context Recall

Context recall measures the extent to which the retrieved context aligns with the annotated answer, treated as the ground truth. It is computed based on the ground truth and the retrieved context, and the values range between 0 and 1, with higher values indicating better performance.

To estimate context recall from the ground truth answer, each sentence in the ground truth answer is analyzed to determine whether it can be attributed to the retrieved context or not. In an ideal scenario, all sentences in the ground truth answer should be attributable to the retrieved context.

result = run_test(
    "validmind.model_validation.ragas.ContextRecall",
    inputs={"dataset": vm_test_ds},
    params={
        "question_column": "question",
        "answer_column": "rag_model_prediction",
        "ground_truth_column": "ground_truth",
        "contexts_column": "retrieval_model_prediction",
    },
)
result.log()

Answer Correctness

The assessment of Answer Correctness involves gauging the accuracy of the generated answer when compared to the ground truth. This evaluation relies on the ground truth and the answer, with scores ranging from 0 to 1. A higher score indicates a closer alignment between the generated answer and the ground truth, signifying better correctness.

Answer correctness encompasses two critical aspects: semantic similarity between the generated answer and the ground truth, as well as factual similarity. These aspects are combined using a weighted scheme to formulate the answer correctness score.

Factual correctness quantifies the factual overlap between the generated answer and the ground truth answer. This is done using the concepts of:

  • TP (True Positive): Facts or statements that are present in both the ground truth and the generated answer.
  • FP (False Positive): Facts or statements that are present in the generated answer but not in the ground truth.
  • FN (False Negative): Facts or statements that are present in the ground truth but not in the generated answer.
result = run_test(
    "validmind.model_validation.ragas.AnswerCorrectness",
    inputs={"dataset": vm_test_ds},
    params={
        "question_column": "question",
        "answer_column": "rag_model_prediction",
        "ground_truth_column": "ground_truth",
        "contexts_column": "retrieval_model_prediction",
    },
)
result.log()

Aspect Critique

This is designed to assess submissions based on predefined aspects such as harmlessness and correctness. Additionally, users have the flexibility to define their own aspects for evaluating submissions according to their specific criteria. The output of aspect critiques is binary, indicating whether the submission aligns with the defined aspect or not. This evaluation is performed using the ‘answer’ as input.

Critiques within the LLM evaluators evaluate submissions based on the provided aspect. Ragas Critiques offers a range of predefined aspects like correctness, harmfulness, etc. Users can also define their own aspects for evaluating submissions based on their specific criteria. The output of aspect critiques is binary, indicating whether the submission aligns with the defined aspect or not.

result = run_test(
    "validmind.model_validation.ragas.AspectCritique",
    inputs={"dataset": vm_test_ds},
    params={
        "question_column": "question",
        "answer_column": "rag_model_prediction",
        "ground_truth_column": "ground_truth",
        "contexts_column": "retrieval_model_prediction",
    },
)
result.log()

Context Utilization

This test is designed to be a reference free version of Context Precision. Context Precision is preferred when a ground truth is available. It works by checking that all of the answer-relevant “claims” present in the retrieved contexts are ranked higher than irrelevant chunks of text. This can tell us how well the retrieval system works at finding and ranking the most relevant context.

result = run_test(
    "validmind.model_validation.ragas.ContextUtilization",
    inputs={"dataset": vm_test_ds},
    params={
        "question_column": "question",
        "answer_column": "rag_model_prediction",
        "ground_truth_column": "ground_truth",
        "contexts_column": "retrieval_model_prediction",
    },
)
result.log()

Noise Sensitivity

This test is designed to evaluate the robustness of the RAG pipeline model against noise in the retrieved context. It works by checking how well the “claims” in the generated answer match up with the “claims” in the ground truth answer. If the generated answer contains “claims” from the contexts that the ground truth answer does not contain, those claims are considered incorrect. The score for each answer is the number of incorrect claims divided by the total number of claims. This can be interpreted as a measure of how sensitive the LLM is to “noise” in the context where “noise” is information that is relevant but should not be included in the answer since the ground truth answer does not contain it.

result = run_test(
    "validmind.model_validation.ragas.NoiseSensitivity",
    inputs={"dataset": vm_test_ds},
    params={
        "question_column": "question",
        "answer_column": "rag_model_prediction",
        "ground_truth_column": "ground_truth",
        "contexts_column": "retrieval_model_prediction",
    },
)
result.log()

Generation quality

In this section, we evaluate the alignment and relevance of generated responses to reference outputs within our retrieval-augmented generation (RAG) application. We use metrics that assess various quality dimensions of the generated responses, including semantic similarity, structural alignment, and phrasing overlap. Semantic similarity metrics compare embeddings of generated and reference text to capture deeper contextual alignment, while overlap and alignment measures quantify how well the phrasing and structure of generated responses match the intended outputs.

Token Disparity

This test assesses the difference in token counts between the reference texts (ground truth) and the answers generated by the RAG model. It helps evaluate how well the model’s outputs align with the expected length and level of detail in the reference texts. A significant disparity in token counts could signal issues with generation quality, such as excessive verbosity or insufficient detail. Consistently low token counts in generated answers compared to references might suggest that the model’s outputs are incomplete or overly concise, missing important contextual information.

test = vm.tests.run_test(
    "validmind.model_validation.TokenDisparity",
    inputs={
        "dataset": vm_test_ds,
        "model": vm_rag_model,
    },
)
test.log()

ROUGE Score

This test evaluates the quality of answers generated by the RAG model by measuring overlaps in n-grams, word sequences, and word pairs between the model output and the reference (ground truth) text. ROUGE, short for Recall-Oriented Understudy for Gisting Evaluation, assesses both precision and recall, providing a balanced view of how well the generated response captures the reference content. ROUGE precision measures the proportion of n-grams in the generated text that match the reference, highlighting relevance and conciseness, while ROUGE recall assesses the proportion of reference n-grams present in the generated text, indicating completeness and thoroughness.

Low precision scores might reveal that the generated text includes redundant or irrelevant information, while low recall scores suggest omissions of essential details from the reference. Consistently low ROUGE scores could indicate poor overall alignment with the ground truth, suggesting the model may be missing key content or failing to capture the intended meaning.

test = vm.tests.run_test(
    "validmind.model_validation.RougeScore",
    inputs={
        "dataset": vm_test_ds,
        "model": vm_rag_model,
    },
    params={
        "metric": "rouge-1",
    },
)
test.log()

BLEU Score

The BLEU Score test evaluates the quality of answers generated by the RAG model by measuring n-gram overlap between the generated text and the reference (ground truth) text, with a specific focus on exact precision in phrasing. While ROUGE precision also assesses overlap, BLEU differs in two main ways: first, it applies a geometric average across multiple n-gram levels, capturing precise phrase alignment, and second, it includes a brevity penalty to prevent overly short outputs from inflating scores artificially. This added precision focus is valuable in RAG applications where strict adherence to reference language is essential, as BLEU emphasizes the match to exact phrasing. In contrast, ROUGE precision evaluates general content overlap without penalizing brevity, offering a broader sense of content alignment.

test = vm.tests.run_test(
    "validmind.model_validation.BleuScore",
    inputs={
        "dataset": vm_test_ds,
        "model": vm_rag_model,
    },
)
test.log()

BERT Score

This test evaluates the quality of the RAG generated answers using BERT embeddings to measure precision, recall, and F1 scores based on semantic similarity, rather than exact n-gram matches as in BLEU and ROUGE. This approach captures contextual meaning, making it valuable when wording differs but the intended message closely aligns with the reference. In RAG applications, the BERT score is especially useful for ensuring that generated answers convey the reference text’s meaning, even if phrasing varies. Consistently low scores indicate a lack of semantic alignment, suggesting the model may miss or misrepresent key content. Low precision may reflect irrelevant or redundant details, while low recall can indicate omissions.

test = vm.tests.run_test(
    "validmind.model_validation.BertScore",
    inputs={
        "dataset": vm_test_ds,
        "model": vm_rag_model,
    },
)
test.log()

METEOR Score

This test evaluates the quality of the generated answers by measuring alignment with the ground truth, emphasizing both accuracy and fluency. Unlike BLEU and ROUGE, which focus on n-gram matches, METEOR combines precision, recall, synonym matching, and word order, focusing at how well the generated text conveys meaning and reads naturally. This metric is especially useful for RAG applications where sentence structure and natural flow are crucial for clear communication. Lower scores may suggest alignment issues, indicating that the answers may lack fluency or key content. Discrepancies in word order or high fragmentation penalties can reveal problems with how the model constructs sentences, potentially affecting readability.

test = vm.tests.run_test(
    "validmind.model_validation.MeteorScore",
    inputs={
        "dataset": vm_test_ds,
        "model": vm_rag_model,
    },
)
test.log()

Bias and Toxicity

In this section, we use metrics like Toxicity Score and Regard Score to evaluate both the generated responses and the ground truth. These tests helps us detect any harmful, offensive, or inappropriate language and evaluate the level of bias and neutrality enabling us to assess and mitigate potential biases in both the model’s responses and the original dataset.

Toxicity Score

This test measures the level of harmful or offensive content in the generated answers. The test uses a preloaded toxicity detection tool from Hugging Face, which identifies language that may be inappropriate, aggressive, or derogatory. High toxicity scores indicate potentially toxic content, while consistently elevated scores across multiple outputs may signal underlying issues in the model’s generation process that require attention to prevent the spread of harmful language.

test = vm.tests.run_test(
    "validmind.model_validation.ToxicityScore",
    inputs={
        "dataset": vm_test_ds,
        "model": vm_rag_model,
    },
)
test.log()

Regard Score

This test evaluates the sentiment and perceived regard—categorized as positive, negative, neutral, or other—in answers generated by the RAG model. This is important for identifying any biases or sentiment tendencies in responses, ensuring that generated answers are balanced and appropriate for the context. The uses a preloaded regard evaluation tool from Hugging Face to compute scores for each response. High skewness in regard scores, especially if the generated responses consistently diverge from expected sentiments in the reference texts, may reveal biases in the model’s generation, such as overly positive or negative tones where neutrality is expected.

test = vm.tests.run_test(
    "validmind.model_validation.RegardScore",
    inputs={
        "dataset": vm_test_ds,
        "model": vm_rag_model,
    },
)
test.log()

Conclusion

In this notebook, we have seen how we can use LangChain and ValidMind together to build, evaluate and document a simple RAG Model as its developed. This is a great example of the interactive development experience that ValidMind is designed to support. We can quickly iterate on our model and document as we go… We have seen how ValidMind supports non-traditional “models” using a functional interface and how we can build pipelines of many models to support complex GenAI workflows.

This is still a work in progress and we are actively developing new tests to support more advanced GenAI workflows. We are also keeping an eye on the most popular GenAI models and libraries to explore direct integrations. Stay tuned for more updates and new features in this area!

Upgrade ValidMind

After installing ValidMind, you’ll want to periodically make sure you are on the latest version to access any new features and other enhancements.

Retrieve the information for the currently installed version of ValidMind:

%pip show validmind

If the version returned is lower than the version indicated in our production open-source code, restart your notebook and run:

%pip install --upgrade validmind

You may need to restart your kernel after running the upgrade package for changes to be applied.