NoiseSensitivity

Assesses the sensitivity of a Large Language Model (LLM) to noise in retrieved context by measuring how often it generates incorrect responses.

Purpose

The Noise Sensitivity test aims to measure how sensitive an LLM is to irrelevant or noisy information within the contextual data used to generate its responses. A lower noise sensitivity score suggests better model robustness in generating accurate answers from given contexts.

Test Mechanism

This test evaluates the model's answers by comparing the claims made in the generated response against the ground truth and the retrieved context. The noise sensitivity score is calculated as:

\[ \\text{noise sensitivity} = {|\\text{Number of incorrect claims in answer}| \\over |\\text{Number of total claims in answer}|} \]

The formula computes the fraction of incorrect claims to the total claims in the answer, using a dataset where answer', 'context', and 'ground_truth' columns are specified.

Configuring Columns

This metric requires the following columns in your dataset:

retrieved_contexts (List[str]): A list of text contexts which are retrieved to generate the answer.
response (str): The response generated by the model
reference (str): The "correct" answer to the question
user_input (str): The user input question If the above data is not in the appropriate column, you can specify different column names for these fields using the parameters retrieved_contexts_column and response_column.

For example, if your dataset has this data stored in different columns, you can pass the following parameters:

{
retrieved_contexts_column": "context_info",
response_column": "my_answer_col",
reference_column": "reference",
user_input_column": "user_input",
}

If the data is stored as a dictionary in another column, specify the column and key like this:

pred_col = dataset.prediction_column(model)
params = {
reference_column": "reference",
retrieved_contexts_column": f"{pred_col}.retrieved_contexts",
response_column": f"{pred_col}.response",
user_input_column": "user_input",
}

For more complex situations, you can use a function to extract the data: ```python pred_col = dataset.prediction_column(model) params = { reference_column": "reference", retrieved_contexts_column": lambda row: [row[pred_col]["context_message"]], response_column": lambda row: "\n\n".join(row[pred_col]["messages"]), user_input_column": "user_input", }

Signs of High Risk

High noise sensitivity scores across multiple samples.
Significant deviation between mean and median noise sensitivity scores.
High standard deviation indicating inconsistency in the model's performance.

Strengths

Provides a quantitative measure of how well the LLM handles noisy or irrelevant context.
Easy integration and configuration using column parameters.
Utilizes both histogram and box plot visualizations to analyze score distribution.

Limitations

Requires accurate ground truth that aligns with the generated answers.
Assumes the context provided is sufficiently granular to assess noise sensitivity.
Primarily applicable to tasks like text QA, text generation, and text summarization where contextual relevance is critical.