%pip install -q validmindValidMind + Databricks Quickstart
Use this notebook to install and run the ValidMind Library inside a Databricks Collaborative Notebook, load data from a Unity Catalog table linked to your model in ValidMind, train a simple classification model, and send the results to the ValidMind Platform.
In this notebook, you will:
- Install and initialize the ValidMind Library
- Load data from a Unity Catalog table linked to your model in ValidMind
- Train a simple classification model
- Run ValidMind tests and send the results to the ValidMind Platform
Before you begin
You will need: 1. A running Databricks workspace with Unity Catalog enabled 2. A ValidMind account with a registered model 3. Your ValidMind API credentials (API key, API secret, model identifier)
To get your credentials: log in to ValidMind → Model Inventory → select your model → Getting Started → Copy snippet to clipboard.
For step-by-step instructions on setting up the Databricks integration and linking a Unity Catalog table to your model, refer to Synchronize with Databricks.
Note: If you don't have a Unity Catalog table linked to your model yet, this notebook includes a synthetic-data fallback so you can still run through the full workflow.
Step 1 — Install the ValidMind Library
Run this cell first. Databricks requires a Python restart after %pip install.
# Restart Python kernel to pick up newly installed packages
dbutils.library.restartPython()Step 2 — Verify installation
Confirm that the ValidMind Library installed successfully and check the version available in your notebook environment:
import importlib.metadata
version = importlib.metadata.version('validmind')
print(f'ValidMind Library version: {version}')
print('Installation successful!')Step 3 — Initialize the ValidMind Library
Initialize the ValidMind Library with the code snippet unique to your model so that test results are uploaded to the correct model in the ValidMind Platform.
You can supply your credentials in either of two ways:
- Databricks widgets: set widgets named
vm_api_host,vm_api_key,vm_api_secret, andvm_model_cuidon the notebook. This is convenient when you parameterize the notebook as part of a Databricks job. - Edit the next cell directly: replace the placeholder values with your own credentials.
To get your credentials:
- In ValidMind, go to Model Inventory and select your model.
- Open Getting Started and click Copy snippet to clipboard.
- Paste the values into the next cell, or use them to set the corresponding widgets:
import validmind as vm
# ---------------------------------------------------------------------------
# Credentials are read from Databricks widgets if set. Otherwise, replace the
# placeholder values below before running this cell.
# ---------------------------------------------------------------------------
try:
api_host = dbutils.widgets.getAll().get("vm_api_host", "<YOUR_API_HOST>")
api_key = dbutils.widgets.getAll().get("vm_api_key", "<YOUR_API_KEY>")
api_secret = dbutils.widgets.getAll().get("vm_api_secret", "<YOUR_API_SECRET>")
model_cuid = dbutils.widgets.getAll().get("vm_model_cuid", "<YOUR_MODEL_CUID>")
except NameError:
# dbutils is not available — running outside Databricks
api_host = "<YOUR_API_HOST>" # replace with your API host
api_key = "<YOUR_API_KEY>" # replace with your API key
api_secret = "<YOUR_API_SECRET>" # replace with your API secret
model_cuid = "<YOUR_MODEL_CUID>" # replace with your model CUID
vm.init(
api_host=api_host,
api_key=api_key,
api_secret=api_secret,
model=model_cuid,
)Step 4 — Load data from your linked Databricks table
Load the data for this notebook from a Unity Catalog table that you've linked to your model in ValidMind. Once a table binding is set up, ValidMind syncs the data and makes it available through the tracking API. You don't need a Spark session or direct Unity Catalog credentials in this notebook.
Before running the next cell, make sure you have:
- A Databricks integration configured in Settings → Integrations → Databricks
- A
tablebinding created for your model that links a Unity Catalog table to it - At least one successful sync (the initial sync runs automatically when you create the binding)
If you don't have a table binding yet, set USE_SYNTHETIC_FALLBACK = True in the next cell to run this notebook with generated data instead.
import requests
import pandas as pd
from validmind import api_client as _vm_client
# Set to True only if you don't have a Databricks table binding set up yet
USE_SYNTHETIC_FALLBACK = False
# ---------------------------------------------------------------------------
# Load from ValidMind — uses the linked Databricks table binding for this model
# ---------------------------------------------------------------------------
if not USE_SYNTHETIC_FALLBACK:
_api_host = _vm_client.get_api_host() # same host as vm.init()
_headers = _vm_client._get_api_headers()
_response = requests.get(
f"{_api_host}/integrations/dataset",
headers=_headers,
timeout=30,
)
if _response.status_code == 200:
_data = _response.json()
TABLE_NAME = _data.get("table_name", "unknown")
TARGET_COLUMN = "target" # <-- update if your table uses a different column name
row_data = _data.get("row_data", [])
if not row_data:
raise RuntimeError(
f"Binding found for table '{TABLE_NAME}' but row_data is empty. "
"The sync may still be in progress — wait a moment and re-run this cell."
)
df = pd.DataFrame(row_data)
if TARGET_COLUMN not in df.columns:
raise ValueError(
f"Column '{TARGET_COLUMN}' not found in synced data. "
f"Available columns: {list(df.columns)}. "
"Update TARGET_COLUMN above to match your table's target column."
)
print(f"Loaded {len(df):,} rows, {len(df.columns)} columns from {TABLE_NAME}")
print(f"Last synced: {_data.get('last_synced_at', 'unknown')}")
print(f"Target distribution: {df[TARGET_COLUMN].value_counts().to_dict()}")
display(df.head())
elif _response.status_code == 404:
raise RuntimeError(
"No active Databricks table binding found for this model.\n\n"
"To fix:\n"
" 1. Go to ValidMind → Settings → Integrations → Databricks\n"
" 2. Open the model binding browser and select a Unity Catalog table\n"
" 3. Wait ~30 seconds for the initial sync to complete\n"
" 4. Re-run this cell\n\n"
"Or set USE_SYNTHETIC_FALLBACK = True above to continue with generated data."
)
else:
raise RuntimeError(
f"Unexpected error loading dataset from ValidMind: "
f"{_response.status_code} — {_response.text}"
)# ---------------------------------------------------------------------------
# Synthetic data fallback — runs when USE_SYNTHETIC_FALLBACK = True
# Uses the Bank Customer Churn dataset pattern from ValidMind examples
# ---------------------------------------------------------------------------
if USE_SYNTHETIC_FALLBACK:
import numpy as np
from sklearn.datasets import make_classification
np.random.seed(42)
X, y = make_classification(
n_samples=1000,
n_features=10,
n_informative=6,
n_redundant=2,
random_state=42,
)
feature_names = [
"credit_score", "age", "tenure", "balance",
"num_products", "has_credit_card", "is_active_member",
"estimated_salary", "geography_encoded", "gender_encoded",
]
df = pd.DataFrame(X, columns=feature_names)
df["target"] = y
TARGET_COLUMN = "target"
TABLE_NAME = "synthetic"
print(f"Using synthetic dataset: {len(df):,} rows, {len(df.columns)} columns")
print(f"Target distribution: {df[TARGET_COLUMN].value_counts().to_dict()}")
display(df.head())Step 5 — Prepare train/test split
Split the dataset into a training set and a test set so you can train the model on one slice of the data and evaluate how it generalizes on data it hasn't seen:
from sklearn.model_selection import train_test_split
feature_columns = [c for c in df.columns if c != TARGET_COLUMN]
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)
print(f'Train set: {len(train_df):,} rows')
print(f'Test set: {len(test_df):,} rows')
print(f'Features: {feature_columns}')Step 6 — Train a simple model
Train a gradient boosting classifier on the training set. This is a small, fast model that's well-suited to a quickstart. The goal here is to produce something documentable end-to-end, not to tune for accuracy.
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier(n_estimators=100, random_state=42)
model.fit(train_df[feature_columns], train_df[TARGET_COLUMN])
train_accuracy = model.score(train_df[feature_columns], train_df[TARGET_COLUMN])
test_accuracy = model.score(test_df[feature_columns], test_df[TARGET_COLUMN])
print(f'Train accuracy: {train_accuracy:.4f}')
print(f'Test accuracy: {test_accuracy:.4f}')Step 7 — Register datasets and model with ValidMind
Before you can run tests, ValidMind needs to know about your datasets and your model. Wrap the training and test DataFrames with vm.init_dataset() and the trained classifier with vm.init_model(). Each call returns a ValidMind object that the test functions accept as input.
The input_id you pass identifies each input when results are sent to the ValidMind Platform.
vm_train_ds = vm.init_dataset(
dataset=train_df,
input_id="train_dataset",
target_column=TARGET_COLUMN,
)
vm_test_ds = vm.init_dataset(
dataset=test_df,
input_id="test_dataset",
target_column=TARGET_COLUMN,
)
vm_model = vm.init_model(
model=model,
input_id="gradient_boosting_model",
)
print('Datasets and model registered with ValidMind.')Step 8 — Assign predictions to datasets
Many tests compare predicted values against actual values, so ValidMind needs the model's predictions attached to each dataset. The assign_predictions() method computes predictions from your model and links them to the dataset object, once for the training set and once for the test set:
vm_train_ds.assign_predictions(model=vm_model)
vm_test_ds.assign_predictions(model=vm_model)
print('Predictions assigned.')Step 9 — Run individual tests
Run a few individual tests against your registered datasets and model to get familiar with how ValidMind tests work before running the full suite. Each vm.tests.run_test() call executes one test, renders the result inline in this notebook, and result.log() sends the result to the ValidMind Platform:
# Dataset statistics — validates data documentation capability
result = vm.tests.run_test(
"validmind.data_validation.DatasetDescription",
inputs={"dataset": vm_train_ds},
)
result.log()# Class imbalance check
result = vm.tests.run_test(
"validmind.data_validation.ClassImbalance",
inputs={"dataset": vm_train_ds},
)
result.log()# Confusion matrix — validates model performance visualization
result = vm.tests.run_test(
"validmind.model_validation.sklearn.ConfusionMatrix",
inputs={"dataset": vm_test_ds, "model": vm_model},
)
result.log()# ROC curve
result = vm.tests.run_test(
"validmind.model_validation.sklearn.ROCCurve",
inputs={"dataset": vm_test_ds, "model": vm_model},
)
result.log()# Feature importance
result = vm.tests.run_test(
"validmind.model_validation.sklearn.FeatureImportance",
inputs={"dataset": vm_train_ds, "model": vm_model},
)
result.log()Step 10 — Run the full test suite
Run the complete classifier documentation suite. This single call executes every test in the suite and sends all results to the ValidMind Platform, where they populate the corresponding sections of your model documentation:
test_suite_result = vm.run_test_suite(
"classifier_full_suite",
inputs={
"dataset": vm_test_ds,
"model": vm_model,
"train_dataset": vm_train_ds,
"test_dataset": vm_test_ds,
},
)
print('Full test suite completed and results sent to ValidMind Platform.')Step 11 — Verify results on the platform
To see the results of this notebook in the ValidMind Platform:
- Go to the ValidMind Platform (or your ValidMind instance).
- Navigate to Model Inventory and select your model.
- Open the Documentation tab.
- Confirm that the test results from this notebook appear in the relevant sections.
After a successful run, you should see the following results in your model's documentation:
- Dataset Description table
- Class Imbalance chart
- Confusion Matrix
- ROC Curve
- Feature Importance chart
- Full classifier suite results
Troubleshooting
If you run into any of the issues below, the table lists the likely fix:
| Issue | Fix |
|---|---|
ModuleNotFoundError after install |
Re-run the dbutils.library.restartPython() cell. |
ConnectionError on vm.init() |
Your workspace may block outbound traffic. Check your network policy, or use a cluster with internet access. |
401 Unauthorized on vm.init() |
The API key or secret is incorrect. Copy your credentials again from the ValidMind Platform. |
numpy version conflict |
Pin a compatible version with %pip install -q validmind "numpy>=1.23,<2.0.0". |
404 on dataset load |
No Databricks table binding was found. Create one in Settings → Integrations → Databricks, then wait for the initial sync to complete. |
row_data is empty after binding created |
The initial sync is still running. Wait about 30 seconds and re-run Step 4. |
| Wrong columns or target not found | Update TARGET_COLUMN in Step 4 to match the target column in your Unity Catalog table. |
| Want to try the notebook without a binding | Set USE_SYNTHETIC_FALLBACK = True in Step 4 to use generated data. |
Copyright © 2023-2026 ValidMind Inc. All rights reserved.
Refer to LICENSE for details.
SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial