ValidMind + Databricks Quickstart

Use this notebook to install and run the ValidMind Library inside a Databricks Collaborative Notebook, load data from a Unity Catalog table linked to your model in ValidMind, train a simple classification model, and send the results to the ValidMind Platform.

In this notebook, you will:

Install and initialize the ValidMind Library
Load data from a Unity Catalog table linked to your model in ValidMind
Train a simple classification model
Run ValidMind tests and send the results to the ValidMind Platform

Before you begin

You will need: 1. A running Databricks workspace with Unity Catalog enabled 2. A ValidMind account with a registered model 3. Your ValidMind API credentials (API key, API secret, model identifier)

To get your credentials: log in to ValidMind → Model Inventory → select your model → Getting Started → Copy snippet to clipboard.

For step-by-step instructions on setting up the Databricks integration and linking a Unity Catalog table to your model, refer to Synchronize with Databricks.

Note: If you don't have a Unity Catalog table linked to your model yet, this notebook includes a synthetic-data fallback so you can still run through the full workflow.

Step 1 — Install the ValidMind Library

Run this cell first. Databricks requires a Python restart after %pip install.

%pip install -q validmind

# Restart Python kernel to pick up newly installed packages
dbutils.library.restartPython()

Step 2 — Verify installation

Confirm that the ValidMind Library installed successfully and check the version available in your notebook environment:

import importlib.metadata
version = importlib.metadata.version('validmind')
print(f'ValidMind Library version: {version}')
print('Installation successful!')

Step 3 — Initialize the ValidMind Library

Initialize the ValidMind Library with the code snippet unique to your model so that test results are uploaded to the correct model in the ValidMind Platform.

You can supply your credentials in either of two ways:

Databricks widgets: set widgets named vm_api_host, vm_api_key, vm_api_secret, and vm_model_cuid on the notebook. This is convenient when you parameterize the notebook as part of a Databricks job.
Edit the next cell directly: replace the placeholder values with your own credentials.

To get your credentials:

In ValidMind, go to Model Inventory and select your model.
Open Getting Started and click Copy snippet to clipboard.
Paste the values into the next cell, or use them to set the corresponding widgets:

import validmind as vm

# ---------------------------------------------------------------------------
# Credentials are read from Databricks widgets if set. Otherwise, replace the
# placeholder values below before running this cell.
# ---------------------------------------------------------------------------
try:
    api_host   = dbutils.widgets.getAll().get("vm_api_host", "<YOUR_API_HOST>")
    api_key    = dbutils.widgets.getAll().get("vm_api_key", "<YOUR_API_KEY>")
    api_secret = dbutils.widgets.getAll().get("vm_api_secret", "<YOUR_API_SECRET>")
    model_cuid = dbutils.widgets.getAll().get("vm_model_cuid", "<YOUR_MODEL_CUID>")
except NameError:
    # dbutils is not available — running outside Databricks
    api_host   = "<YOUR_API_HOST>" # replace with your API host
    api_key    = "<YOUR_API_KEY>" # replace with your API key
    api_secret = "<YOUR_API_SECRET>" # replace with your API secret
    model_cuid = "<YOUR_MODEL_CUID>" # replace with your model CUID

vm.init(
    api_host=api_host,
    api_key=api_key,
    api_secret=api_secret,
    model=model_cuid,
)

Step 4 — Load data from your linked Databricks table

Load the data for this notebook from a Unity Catalog table that you've linked to your model in ValidMind. Once a table binding is set up, ValidMind syncs the data and makes it available through the tracking API. You don't need a Spark session or direct Unity Catalog credentials in this notebook.

Before running the next cell, make sure you have:

A Databricks integration configured in Settings → Integrations → Databricks
A table binding created for your model that links a Unity Catalog table to it
At least one successful sync (the initial sync runs automatically when you create the binding)

If you don't have a table binding yet, set USE_SYNTHETIC_FALLBACK = True in the next cell to run this notebook with generated data instead.

import requests
import pandas as pd
from validmind import api_client as _vm_client

# Set to True only if you don't have a Databricks table binding set up yet
USE_SYNTHETIC_FALLBACK = False

# ---------------------------------------------------------------------------
# Load from ValidMind — uses the linked Databricks table binding for this model
# ---------------------------------------------------------------------------
if not USE_SYNTHETIC_FALLBACK:
    _api_host = _vm_client.get_api_host()  # same host as vm.init()
    _headers  = _vm_client._get_api_headers()

    _response = requests.get(
        f"{_api_host}/integrations/dataset",
        headers=_headers,
        timeout=30,
    )

    if _response.status_code == 200:
        _data = _response.json()
        TABLE_NAME    = _data.get("table_name", "unknown")
        TARGET_COLUMN = "target"  # <-- update if your table uses a different column name
        row_data      = _data.get("row_data", [])

        if not row_data:
            raise RuntimeError(
                f"Binding found for table '{TABLE_NAME}' but row_data is empty. "
                "The sync may still be in progress — wait a moment and re-run this cell."
            )

        df = pd.DataFrame(row_data)

        if TARGET_COLUMN not in df.columns:
            raise ValueError(
                f"Column '{TARGET_COLUMN}' not found in synced data. "
                f"Available columns: {list(df.columns)}. "
                "Update TARGET_COLUMN above to match your table's target column."
            )

        print(f"Loaded {len(df):,} rows, {len(df.columns)} columns from {TABLE_NAME}")
        print(f"Last synced: {_data.get('last_synced_at', 'unknown')}")
        print(f"Target distribution: {df[TARGET_COLUMN].value_counts().to_dict()}")
        display(df.head())

    elif _response.status_code == 404:
        raise RuntimeError(
            "No active Databricks table binding found for this model.\n\n"
            "To fix:\n"
            "  1. Go to ValidMind → Settings → Integrations → Databricks\n"
            "  2. Open the model binding browser and select a Unity Catalog table\n"
            "  3. Wait ~30 seconds for the initial sync to complete\n"
            "  4. Re-run this cell\n\n"
            "Or set USE_SYNTHETIC_FALLBACK = True above to continue with generated data."
        )
    else:
        raise RuntimeError(
            f"Unexpected error loading dataset from ValidMind: "
            f"{_response.status_code} — {_response.text}"
        )

# ---------------------------------------------------------------------------
# Synthetic data fallback — runs when USE_SYNTHETIC_FALLBACK = True
# Uses the Bank Customer Churn dataset pattern from ValidMind examples
# ---------------------------------------------------------------------------
if USE_SYNTHETIC_FALLBACK:
    import numpy as np
    from sklearn.datasets import make_classification

    np.random.seed(42)
    X, y = make_classification(
        n_samples=1000,
        n_features=10,
        n_informative=6,
        n_redundant=2,
        random_state=42,
    )
    feature_names = [
        "credit_score", "age", "tenure", "balance",
        "num_products", "has_credit_card", "is_active_member",
        "estimated_salary", "geography_encoded", "gender_encoded",
    ]
    df = pd.DataFrame(X, columns=feature_names)
    df["target"] = y
    TARGET_COLUMN = "target"
    TABLE_NAME    = "synthetic"

    print(f"Using synthetic dataset: {len(df):,} rows, {len(df.columns)} columns")
    print(f"Target distribution: {df[TARGET_COLUMN].value_counts().to_dict()}")
    display(df.head())

Step 5 — Prepare train/test split

Split the dataset into a training set and a test set so you can train the model on one slice of the data and evaluate how it generalizes on data it hasn't seen:

from sklearn.model_selection import train_test_split

feature_columns = [c for c in df.columns if c != TARGET_COLUMN]

train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

print(f'Train set: {len(train_df):,} rows')
print(f'Test set:  {len(test_df):,} rows')
print(f'Features:  {feature_columns}')

Step 6 — Train a simple model

Train a gradient boosting classifier on the training set. This is a small, fast model that's well-suited to a quickstart. The goal here is to produce something documentable end-to-end, not to tune for accuracy.

from sklearn.ensemble import GradientBoostingClassifier

model = GradientBoostingClassifier(n_estimators=100, random_state=42)
model.fit(train_df[feature_columns], train_df[TARGET_COLUMN])

train_accuracy = model.score(train_df[feature_columns], train_df[TARGET_COLUMN])
test_accuracy  = model.score(test_df[feature_columns],  test_df[TARGET_COLUMN])

print(f'Train accuracy: {train_accuracy:.4f}')
print(f'Test accuracy:  {test_accuracy:.4f}')

Step 7 — Register datasets and model with ValidMind

Before you can run tests, ValidMind needs to know about your datasets and your model. Wrap the training and test DataFrames with vm.init_dataset() and the trained classifier with vm.init_model(). Each call returns a ValidMind object that the test functions accept as input.

The input_id you pass identifies each input when results are sent to the ValidMind Platform.

vm_train_ds = vm.init_dataset(
    dataset=train_df,
    input_id="train_dataset",
    target_column=TARGET_COLUMN,
)

vm_test_ds = vm.init_dataset(
    dataset=test_df,
    input_id="test_dataset",
    target_column=TARGET_COLUMN,
)

vm_model = vm.init_model(
    model=model,
    input_id="gradient_boosting_model",
)

print('Datasets and model registered with ValidMind.')

Step 8 — Assign predictions to datasets

Many tests compare predicted values against actual values, so ValidMind needs the model's predictions attached to each dataset. The assign_predictions() method computes predictions from your model and links them to the dataset object, once for the training set and once for the test set:

vm_train_ds.assign_predictions(model=vm_model)
vm_test_ds.assign_predictions(model=vm_model)

print('Predictions assigned.')

Step 9 — Run individual tests

Run a few individual tests against your registered datasets and model to get familiar with how ValidMind tests work before running the full suite. Each vm.tests.run_test() call executes one test, renders the result inline in this notebook, and result.log() sends the result to the ValidMind Platform:

# Dataset statistics — validates data documentation capability
result = vm.tests.run_test(
    "validmind.data_validation.DatasetDescription",
    inputs={"dataset": vm_train_ds},
)
result.log()

# Class imbalance check
result = vm.tests.run_test(
    "validmind.data_validation.ClassImbalance",
    inputs={"dataset": vm_train_ds},
)
result.log()

# Confusion matrix — validates model performance visualization
result = vm.tests.run_test(
    "validmind.model_validation.sklearn.ConfusionMatrix",
    inputs={"dataset": vm_test_ds, "model": vm_model},
)
result.log()

# ROC curve
result = vm.tests.run_test(
    "validmind.model_validation.sklearn.ROCCurve",
    inputs={"dataset": vm_test_ds, "model": vm_model},
)
result.log()

# Feature importance
result = vm.tests.run_test(
    "validmind.model_validation.sklearn.FeatureImportance",
    inputs={"dataset": vm_train_ds, "model": vm_model},
)
result.log()

Step 10 — Run the full test suite

Run the complete classifier documentation suite. This single call executes every test in the suite and sends all results to the ValidMind Platform, where they populate the corresponding sections of your model documentation:

test_suite_result = vm.run_test_suite(
    "classifier_full_suite",
    inputs={
        "dataset": vm_test_ds,
        "model": vm_model,
        "train_dataset": vm_train_ds,
        "test_dataset": vm_test_ds,
    },
)
print('Full test suite completed and results sent to ValidMind Platform.')

Step 11 — Verify results on the platform

To see the results of this notebook in the ValidMind Platform:

Go to the ValidMind Platform (or your ValidMind instance).
Navigate to Model Inventory and select your model.
Open the Documentation tab.
Confirm that the test results from this notebook appear in the relevant sections.

After a successful run, you should see the following results in your model's documentation:

Dataset Description table
Class Imbalance chart
Confusion Matrix
ROC Curve
Feature Importance chart
Full classifier suite results

Troubleshooting

If you run into any of the issues below, the table lists the likely fix:

Issue	Fix
`ModuleNotFoundError` after install	Re-run the `dbutils.library.restartPython()` cell.
`ConnectionError` on `vm.init()`	Your workspace may block outbound traffic. Check your network policy, or use a cluster with internet access.
`401 Unauthorized` on `vm.init()`	The API key or secret is incorrect. Copy your credentials again from the ValidMind Platform.
`numpy` version conflict	Pin a compatible version with `%pip install -q validmind "numpy>=1.23,<2.0.0"`.
`404` on dataset load	No Databricks table binding was found. Create one in Settings → Integrations → Databricks, then wait for the initial sync to complete.
`row_data is empty` after binding created	The initial sync is still running. Wait about 30 seconds and re-run Step 4.
Wrong columns or target not found	Update `TARGET_COLUMN` in Step 4 to match the target column in your Unity Catalog table.
Want to try the notebook without a binding	Set `USE_SYNTHETIC_FALLBACK = True` in Step 4 to use generated data.