diff --git a/notebooks/code_sharing/operational_deposit/operational_deposit_poc.ipynb b/notebooks/code_sharing/operational_deposit/operational_deposit_poc.ipynb index 23144cd2f..8d0b912f2 100644 --- a/notebooks/code_sharing/operational_deposit/operational_deposit_poc.ipynb +++ b/notebooks/code_sharing/operational_deposit/operational_deposit_poc.ipynb @@ -1132,6 +1132,7 @@ }, { "cell_type": "markdown", + "id": "copyright-e2f10038a74449cba590e46511b2368c", "metadata": {}, "source": [ "\n", diff --git a/notebooks/how_to/data_and_datasets/use_dataset_model_objects.ipynb b/notebooks/how_to/data_and_datasets/use_dataset_model_objects.ipynb index cc9975c8c..f7a2eb55f 100644 --- a/notebooks/how_to/data_and_datasets/use_dataset_model_objects.ipynb +++ b/notebooks/how_to/data_and_datasets/use_dataset_model_objects.ipynb @@ -1,983 +1,984 @@ { - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Introduction to ValidMind Dataset and Model Objects\n", - "\n", - "When writing custom tests, it is essential to be aware of the interfaces of the ValidMind Dataset and ValidMind Model, which are used as input arguments.\n", - "\n", - "As a model developer, writing custom tests is beneficial when the ValidMind library lacks a built-in test for your specific needs. For example, a model might require new tests to evaluate specific aspects of the model or dataset based on a particular use case.\n", - "\n", - "This interactive notebook offers a detailed understanding of ValidMind objects and their use in writing custom tests. It introduces various interfaces provided by these objects and demonstrates how they can be leveraged to implement tests effortlessly." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "::: {.content-hidden when-format=\"html\"}\n", - "## Contents \n", - "- [About ValidMind](#toc1__) \n", - " - [Before you begin](#toc1_1__) \n", - " - [New to ValidMind?](#toc1_2__) \n", - " - [Key concepts](#toc1_3__) \n", - "- [Setting up](#toc2__) \n", - " - [Install the ValidMind Library](#toc2_1__) \n", - " - [Initialize the ValidMind Library](#toc2_2__) \n", - " - [Register sample model](#toc2_2_1__) \n", - " - [Apply documentation template](#toc2_2_2__) \n", - " - [Get your code snippet](#toc2_2_3__) \n", - "- [Load the demo dataset](#toc3__) \n", - " - [Prepocess the raw dataset](#toc3_1__) \n", - "- [Train a model for testing](#toc4__) \n", - "- [Explore basic components of the ValidMind library](#toc5__) \n", - " - [VMDataset Object](#toc5_1__) \n", - " - [Initialize the ValidMind datasets](#toc5_1_1__) \n", - " - [ Interfaces of the dataset object](#toc5_1_2__) \n", - " - [Using VM Dataset object as arguments in custom tests](#toc5_2__) \n", - " - [Run the test](#toc5_2_1__) \n", - " - [Using VM Dataset object and parameters as arguments in custom tests](#toc5_3__) \n", - " - [VMModel Object](#toc5_4__) \n", - " - [Initialize ValidMind model object](#toc5_5__) \n", - " - [Assign predictions to the datasets](#toc5_6__) \n", - " - [Using VM Model and Dataset objects as arguments in Custom tests](#toc5_7__) \n", - " - [Log the test results](#toc5_8__) \n", - "- [Where to go from here](#toc6__) \n", - " - [Discover more learning resources](#toc6_1__) \n", - "- [Upgrade ValidMind](#toc7__) \n", - "\n", - ":::\n", - "\n", - "" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "## About ValidMind\n", - "\n", - "ValidMind is a suite of tools for managing model risk, including risk associated with AI and statistical models. You use the ValidMind Library to automate documentation and validation tests, and then use the ValidMind Platform to collaborate on model documentation. Together, these products simplify model risk management, facilitate compliance with regulations and institutional standards, and enhance collaboration between yourself and model validators.\n", - "\n", - "\n", - "\n", - "### Before you begin\n", - "\n", - "This notebook assumes you have basic familiarity with Python, including an understanding of how functions work. If you are new to Python, you can still run the notebook but we recommend further familiarizing yourself with the language.\n", - "\n", - "If you encounter errors due to missing modules in your Python environment, install the modules with `pip install`, and then re-run the notebook. For more help, refer to [Installing Python Modules](https://docs.python.org/3/installing/index.html).\n", - "\n", - "\n", - "\n", - "### New to ValidMind?\n", - "\n", - "If you haven't already seen our documentation on the [ValidMind Library](https://docs.validmind.ai/developer/validmind-library.html), we recommend you begin by exploring the available resources in this section. There, you can learn more about documenting models and running tests, as well as find code samples and our Python Library API reference.\n", - "\n", - "
For access to all features available in this notebook, you'll need access to a ValidMind account.\n", - "

\n", - "Register with ValidMind
\n", - "\n", - "\n", - "\n", - "### Key concepts\n", - "\n", - "Here, we will focus on ValidMind dataset, ValidMind model and tests to use these objects to generate artefacts for the documentation.\n", - "\n", - "**Tests**: A function contained in the ValidMind Library, designed to run a specific quantitative test on the dataset or model. Tests are the building blocks of ValidMind, used to evaluate and document models and datasets, and can be run individually or as part of a suite defined by your model documentation template.\n", - "\n", - "**Custom tests**: Custom tests are functions that you define to evaluate your model or dataset. These functions can be registered via the ValidMind Library to be used with the ValidMind Platform.\n", - "\n", - "**Inputs**: Objects to be evaluated and documented in the ValidMind Library. They can be any of the following:\n", - "\n", - "- **model**: A single ValidMind model object that has been initialized in ValidMind with [`vm.init_model()`](https://docs.validmind.ai/validmind/validmind.html#init_model).\n", - "- **dataset**: Single ValidMind dataset object that has been initialized in ValidMind with [`vm.init_dataset()`](https://docs.validmind.ai/validmind/validmind.html#init_dataset).\n", - "- **models**: A list of ValidMind models - usually this is used when you want to compare multiple models in your custom test.\n", - "- **datasets**: A list of ValidMind datasets - usually this is used when you want to compare multiple datasets in your custom test. See this [example](https://docs.validmind.ai/notebooks/how_to/tests/run_tests/configure_tests/run_tests_that_require_multiple_datasets.html) for more information.\n", - "\n", - "**Parameters**: Additional arguments that can be passed when running a ValidMind test, used to pass additional information to a test, customize its behavior, or provide additional context.\n", - "\n", - "**Outputs**: Tests can return elements like tables or plots. Tables may be a list of dictionaries (each representing a row) or a pandas DataFrame. Plots may be matplotlib or plotly figures.\n", - "\n", - "**Dataset based Test**\n", - "\n", - "![Dataset based test architecture](./dataset_image.png)\n", - "The dataset based tests take VM dataset object(s) as inputs, test configuration as test parameters to produce `Outputs` as mentioned above.\n", - "\n", - "**Model based Test**\n", - "\n", - "![Model based test architecture](./model_image.png)\n", - "Similar to datasest based tests, the model based tests as an additional input that is VM model object. It allows to identify prediction values of a specific model in the dataset object. " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "## Setting up" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "### Install the ValidMind Library\n", - "\n", - "Please note the following recommended Python versions to use:\n", - "\n", - "- Python 3.7 > x <= 3.11\n", - "\n", - "To install the library:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%pip install -q validmind" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "### Initialize the ValidMind Library" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "#### Register sample model\n", - "\n", - "Let's first register a sample model for use with this notebook:\n", - "\n", - "1. In a browser, [log in to ValidMind](https://docs.validmind.ai/guide/configuration/log-in-to-validmind.html).\n", - "\n", - "2. In the left sidebar, navigate to **Inventory** and click **+ Register Model**.\n", - "\n", - "3. Enter the model details and click **Next >** to continue to assignment of model stakeholders. ([Need more help?](https://docs.validmind.ai/guide/model-inventory/register-models-in-inventory.html))\n", - "\n", - "4. Select your own name under the **MODEL OWNER** drop-down.\n", - "\n", - "5. Click **Register Model** to add the model to your inventory." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "#### Apply documentation template\n", - "\n", - "Once you've registered your model, let's select a documentation template. A template predefines sections for your model documentation and provides a general outline to follow, making the documentation process much easier.\n", - "\n", - "1. In the left sidebar that appears for your model, click **Documents** and select **Documentation**.\n", - "\n", - "2. Under **TEMPLATE**, select `Binary classification`.\n", - "\n", - "3. Click **Use Template** to apply the template." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "#### Get your code snippet\n", - "\n", - "ValidMind generates a unique _code snippet_ for each registered model to connect with your developer environment. You initialize the ValidMind Library with this code snippet, which ensures that your documentation and tests are uploaded to the correct model when you run the notebook.\n", - "\n", - "1. On the left sidebar that appears for your model, select **Getting Started** and click **Copy snippet to clipboard**.\n", - "2. Next, [load your model identifier credentials from an `.env` file](https://docs.validmind.ai/developer/model-documentation/store-credentials-in-env-file.html) or replace the placeholder with your own code snippet:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "metadata": {} - }, - "outputs": [], - "source": [ - "# Load your model identifier credentials from an `.env` file\n", - "\n", - "%load_ext dotenv\n", - "%dotenv .env\n", - "\n", - "# Or replace with your code snippet\n", - "\n", - "import validmind as vm\n", - "\n", - "vm.init(\n", - " # api_host=\"...\",\n", - " # api_key=\"...\",\n", - " # api_secret=\"...\",\n", - " # model=\"...\",\n", - ")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%matplotlib inline\n", - "\n", - "import xgboost as xgb" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "## Load the demo dataset" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from validmind.datasets.classification import customer_churn as demo_dataset\n", - "\n", - "raw_df = demo_dataset.load_data()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "### Prepocess the raw dataset" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "train_df, validation_df, test_df = demo_dataset.preprocess(raw_df)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "## Train a model for testing\n", - "\n", - "We train a simple customer churn model for our test." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "x_train = train_df.drop(demo_dataset.target_column, axis=1)\n", - "y_train = train_df[demo_dataset.target_column]\n", - "x_val = validation_df.drop(demo_dataset.target_column, axis=1)\n", - "y_val = validation_df[demo_dataset.target_column]\n", - "\n", - "model = xgb.XGBClassifier(early_stopping_rounds=10)\n", - "model.set_params(\n", - " eval_metric=[\"error\", \"logloss\", \"auc\"],\n", - ")\n", - "model.fit(\n", - " x_train,\n", - " y_train,\n", - " eval_set=[(x_val, y_val)],\n", - " verbose=False,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "## Explore basic components of the ValidMind library\n", - "\n", - "In this section, you will learn about the basic objects of the ValidMind library that are necessary to implement both custom and built-in tests. As explained above, these objects are:\n", - "* VMDataset: [The high level APIs can be found here](https://docs.validmind.ai/validmind/validmind/vm_models.html#VMDataset)\n", - "* VMModel: [The high level APIs can be found here](https://docs.validmind.ai/validmind/validmind/vm_models.html#VMModel)\n", - "\n", - "Let's understand these objects and their interfaces step by step: " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "### VMDataset Object" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "#### Initialize the ValidMind datasets\n", - "\n", - "You can initialize a ValidMind dataset object using the [`init_dataset`](https://docs.validmind.ai/validmind/validmind.html#init_dataset) function from the ValidMind (`vm`) module.\n", - "\n", - "The function wraps the dataset to create a ValidMind `Dataset` object so that you can write tests effectively using the common interface provided by the VM objects. This step is always necessary every time you want to connect a dataset to documentation and produce test results through ValidMind. You only need to do it one time per dataset.\n", - "\n", - "This function takes a number of arguments. Some of the arguments are:\n", - "\n", - "- `dataset` — the raw dataset that you want to provide as input to tests\n", - "- `input_id` - a unique identifier that allows tracking what inputs are used when running each individual test\n", - "- `target_column` — a required argument if tests require access to true values. This is the name of the target column in the dataset\n", - "\n", - "The detailed list of the arguments can be found [here](https://docs.validmind.ai/validmind/validmind.html#init_dataset) " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# vm_raw_dataset is now a VMDataset object that you can pass to any ValidMind test\n", - "vm_raw_dataset = vm.init_dataset(\n", - " dataset=raw_df,\n", - " input_id=\"raw_dataset\",\n", - " target_column=\"Exited\",\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Once you have a ValidMind dataset object (VMDataset), you can inspect its attributes and methods using the inspect_obj utility module. This module provides a list of available attributes and interfaces for use in tests. Understanding how to use VMDatasets is crucial for comprehending how a custom test functions." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from validmind.utils import inspect_obj\n", - "inspect_obj(vm_raw_dataset)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "#### Interfaces of the dataset object\n", - "\n", - "**DataFrame**" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "vm_raw_dataset.df" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Feature columns**" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "vm_raw_dataset.feature_columns" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Target column**" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "vm_raw_dataset.target_column" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Features values**" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "vm_raw_dataset.x_df()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Target value**" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "vm_raw_dataset.y_df()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Numeric feature columns** " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "vm_raw_dataset.feature_columns_numeric" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Categorical feature columns** " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "vm_raw_dataset.feature_columns_categorical" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Similarly, you can use all other interfaces of the [VMDataset objects](https://docs.validmind.ai/validmind/validmind/vm_models.html#VMDataset) " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "### Using VM Dataset object as arguments in custom tests\n", - "\n", - "A custom test is simply a Python function that takes two types of arguments: `inputs` and `params`. The `inputs` are ValidMind objects (`VMDataset`, `VMModel`), and the `params` are additional parameters required for the underlying computation of the test. We will discuss both types of arguments in the following sections.\n", - "\n", - "Let's start with a custom test that requires only a ValidMind dataset object. In this example, we will check the balance of classes in the target column of the dataset:\n", - "\n", - "- The custom test below requires a single argument of type `VMDataset` (dataset).\n", - "- The `my_custom_tests.ClassImbalance` is a unique test identifier that can be assigned using the `vm.test` decorator functionality. This unique test ID will be used in the platform to load test results in the documentation.\n", - "- The `dataset.target_column` and `dataset.df` attributes of the `VMDataset` object are used in the test.\n", - "\n", - "Other high-level APIs (attributes and methods) of the dataset object are listed [here](https://docs.validmind.ai/validmind/validmind/vm_models.html#VMDataset).\n", - "\n", - "If you've gone through the [Implement custom tests notebook](../tests/custom_tests/implement_custom_tests.ipynb), you should have a good understanding of how custom tests are implemented in details. If you haven't, we recommend going through that notebook first." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from validmind.vm_models.dataset.dataset import VMDataset\n", - "import pandas as pd\n", - "\n", - "@vm.test(\"my_custom_tests.ClassImbalance\")\n", - "def class_imbalance(dataset):\n", - " # Can only run this test if we have a Dataset object\n", - " if not isinstance(dataset, VMDataset):\n", - " raise ValueError(\"ClassImbalance requires a validmind Dataset object\")\n", - "\n", - " if dataset.target_column is None:\n", - " print(\"Skipping class_imbalance test because no target column is defined\")\n", - " return\n", - "\n", - " # VMDataset object provides target_column attribute\n", - " target_column = dataset.target_column\n", - " # we can access pandas DataFrame using df attribute\n", - " imbalance_percentages = dataset.df[target_column].value_counts(\n", - " normalize=True\n", - " )\n", - " classes = list(imbalance_percentages.index) \n", - " percentages = list(imbalance_percentages.values * 100)\n", - "\n", - " return pd.DataFrame({\"Classes\":classes, \"Percentage\": percentages})" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "#### Run the test\n", - "\n", - "Let's run the test using the `run_test` method, which is part of the `validmind.tests` module. Here, we pass the `dataset` through the `inputs`. Similarly, you can pass `datasets`, `model`, or `models` as inputs if your custom test requires them. In this example below, we run the custom test `my_custom_tests.ClassImbalance` by passing the `dataset` through the `inputs`. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from validmind.tests import run_test\n", - "result = run_test(\n", - " test_id=\"my_custom_tests.ClassImbalance\",\n", - " inputs={\n", - " \"dataset\": vm_raw_dataset\n", - " }\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You can move custom tests into separate modules in a folder. It allows you to take one-off tests and move them into an organized structure that makes it easier to manage, maintain and share them. We have provided a seperate notebook with detailed explaination [here](../tests/custom_tests/integrate_external_test_providers.ipynb) " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "### Using VM Dataset object and parameters as arguments in custom tests\n", - "\n", - "Simlilar to `inputs`, you can pass `params` to a custom test by providing a dictionary of parameters to the `run_test()` function. The parameters will override any default parameters set in the custom test definition. Note that the `dataset` is still passed as `inputs`. \n", - "Let's modify the class imbalance test so that it provides flexibility to `normalize` the results." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from validmind.vm_models.dataset.dataset import VMDataset\n", - "import pandas as pd\n", - "\n", - "@vm.test(\"my_custom_tests.ClassImbalance\")\n", - "def class_imbalance(dataset, normalize=True):\n", - " # Can only run this test if we have a Dataset object\n", - " if not isinstance(dataset, VMDataset):\n", - " raise ValueError(\"ClassImbalance requires a validmind Dataset object\")\n", - "\n", - " if dataset.target_column is None:\n", - " print(\"Skipping class_imbalance test because no target column is defined\")\n", - " return\n", - "\n", - " # VMDataset object provides target_column attribute\n", - " target_column = dataset.target_column\n", - " # we can access pandas DataFrame using df attribute\n", - " imbalance_percentages = dataset.df[target_column].value_counts(\n", - " normalize=normalize\n", - " )\n", - " classes = list(imbalance_percentages.index) \n", - " if normalize: \n", - " result = pd.DataFrame({\"Classes\":classes, \"Percentage\": list(imbalance_percentages.values*100)})\n", - " else:\n", - " result = pd.DataFrame({\"Classes\":classes, \"Count\": list(imbalance_percentages.values)})\n", - " return result" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In this example, the `normalize` parameter is set to `False`, so the class counts will not be normalized. You can change the value to `True` if you want the counts to be normalized. The results of the test will reflect this flexibility, allowing for different outputs based on the parameter passed.\n", - "\n", - "Here, we have passed the `dataset` through the `inputs` and the `normalize` parameter using the `params`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from validmind.tests import run_test\n", - "result = run_test(\n", - " test_id = \"my_custom_tests.ClassImbalance\",\n", - " inputs={\"dataset\": vm_raw_dataset},\n", - " params={\"normalize\": True},\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "### VMModel Object" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "### Initialize ValidMind model object\n", - "\n", - "Similar to ValidMind `Dataset` object, you can initialize a ValidMind Model object using the [`init_model`](https://docs.validmind.ai/validmind/validmind.html#init_model) function from the ValidMind (`vm`) module.\n", - "\n", - "This function takes a number of arguments. Some of the arguments are:\n", - "\n", - "- `model` — the raw model that you want evaluate\n", - "- `input_id` - a unique identifier that allows tracking what inputs are used when running each individual test\n", - "\n", - "The detailed list of the arguments can be found [here](https://docs.validmind.ai/validmind/validmind.html#init_model) " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "\n", - "vm_model = vm.init_model(\n", - " model=model,\n", - " input_id=\"xgb_model\",\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's inspect the methods and attributes of the model now:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "inspect_obj(vm_model)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "### Assign predictions to the datasets\n", - "\n", - "We can now use the `assign_predictions()` method from the `Dataset` object to link existing predictions to any model. If no prediction values are passed, the method will compute predictions automatically:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "vm_train_ds = vm.init_dataset(\n", - " input_id=\"train_dataset\",\n", - " dataset=train_df,\n", - " type=\"generic\",\n", - " target_column=demo_dataset.target_column,\n", - ")\n", - "\n", - "vm_train_ds.assign_predictions(model=vm_model)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You can see below, the extra prediction column (`xgb_model_prediction`) for the model (`xgb_model`) has been added in the dataset." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "print(vm_train_ds)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "### Using VM Model and Dataset objects as arguments in Custom tests\n", - "\n", - "We will now create a `@vm.test` wrapper that will allow you to create a reusable test. Note the following changes in the code below:\n", - "\n", - "- The function `confusion_matrix` takes two arguments `dataset` and `model`. This is a `VMDataset` and `VMModel` object respectively.\n", - " - `VMDataset` objects allow you to access the dataset's true (target) values by accessing the `.y` attribute.\n", - " - `VMDataset` objects allow you to access the predictions for a given model by accessing the `.y_pred()` method.\n", - "- The function docstring provides a description of what the test does. This will be displayed along with the result in this notebook as well as in the ValidMind Platform.\n", - "- The function body calculates the confusion matrix using the `sklearn.tests.confusion_matrix` function as we just did above.\n", - "- The function then returns the `ConfusionMatrixDisplay.figure_` object - this is important as the ValidMind Library expects the output of the custom test to be a plot or a table.\n", - "- The `@vm.test` decorator is doing the work of creating a wrapper around the function that will allow it to be run by the ValidMind Library. It also registers the test so it can be found by the ID `my_custom_tests.ConfusionMatrix` (see the section below on how test IDs work in ValidMind and why this format is important)\n", - "\n", - "Similarly, you can use the functinality provided by `VMDataset` and `VMModel` objects. You can refer our documentation page for all the avalialble APIs [here](https://docs.validmind.ai/validmind/validmind.html#init_dataset)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from sklearn import metrics\n", - "import matplotlib.pyplot as plt\n", - "@vm.test(\"my_custom_tests.ConfusionMatrix\")\n", - "def confusion_matrix(dataset, model):\n", - " \"\"\"The confusion matrix is a table that is often used to describe the performance of a classification model on a set of data for which the true values are known.\n", - "\n", - " The confusion matrix is a 2x2 table that contains 4 values:\n", - "\n", - " - True Positive (TP): the number of correct positive predictions\n", - " - True Negative (TN): the number of correct negative predictions\n", - " - False Positive (FP): the number of incorrect positive predictions\n", - " - False Negative (FN): the number of incorrect negative predictions\n", - "\n", - " The confusion matrix can be used to assess the holistic performance of a classification model by showing the accuracy, precision, recall, and F1 score of the model on a single figure.\n", - " \"\"\"\n", - " # we can retrieve traget value from dataset which is y attribute\n", - " y_true = dataset.y\n", - " # The prediction value of a specific model using y_pred method \n", - " y_pred = dataset.y_pred(model=model)\n", - "\n", - " confusion_matrix = metrics.confusion_matrix(y_true, y_pred)\n", - "\n", - " cm_display = metrics.ConfusionMatrixDisplay(\n", - " confusion_matrix=confusion_matrix, display_labels=[False, True]\n", - " )\n", - " cm_display.plot()\n", - " plt.close()\n", - "\n", - " return cm_display.figure_ # return the figure object itself" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Here, we run test using two inputs; `dataset` and `model`. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from validmind.tests import run_test\n", - "result = run_test(\n", - " test_id = \"my_custom_tests.ConfusionMatrix\",\n", - " inputs={\n", - " \"dataset\": vm_train_ds,\n", - " \"model\": vm_model,\n", - " }\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "### Log the test results\n", - "\n", - "You can log any test result to the ValidMind Platform with the `.log()` method of the result object. This will allow you to add the result to the documentation.\n", - "\n", - "You can now do the same for the confusion matrix results." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "result.log()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "## Where to go from here\n", - "\n", - "In this notebook you have learned the end-to-end process to document a model with the ValidMind Library, running through some very common scenarios in a typical model development setting:\n", - "\n", - "- Running out-of-the-box tests\n", - "- Documenting your model by adding evidence to model documentation\n", - "- Extending the capabilities of the ValidMind Library by implementing custom tests\n", - "- Ensuring that the documentation is complete by running all tests in the documentation template\n", - "\n", - "\n", - "\n", - "### Discover more learning resources\n", - "\n", - "We offer many interactive notebooks to help you automate testing, documenting, validating, and more:\n", - "\n", - "- [Run tests & test suites](https://docs.validmind.ai/developer/how-to/testing-overview.html)\n", - "- [Use ValidMind Library features](https://docs.validmind.ai/developer/how-to/feature-overview.html)\n", - "- [Code samples by use case](https://docs.validmind.ai/guide/samples-jupyter-notebooks.html)\n", - "\n", - "Or, visit our [documentation](https://docs.validmind.ai/) to learn more about ValidMind." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "## Upgrade ValidMind\n", - "\n", - "
After installing ValidMind, you’ll want to periodically make sure you are on the latest version to access any new features and other enhancements.
\n", - "\n", - "Retrieve the information for the currently installed version of ValidMind:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%pip show validmind" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "If the version returned is lower than the version indicated in our [production open-source code](https://github.com/validmind/validmind-library/blob/prod/validmind/__version__.py), restart your notebook and run:\n", - "\n", - "```bash\n", - "%pip install --upgrade validmind\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You may need to restart your kernel after running the upgrade package for changes to be applied." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "\n", - "\n", - "***\n", - "\n", - "Copyright © 2023-2026 ValidMind Inc. All rights reserved.
\n", - "Refer to [LICENSE](https://github.com/validmind/validmind-library/blob/main/LICENSE) for details.
\n", - "SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial
" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.10.14" - } - }, - "nbformat": 4, - "nbformat_minor": 2 + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Introduction to ValidMind Dataset and Model Objects\n", + "\n", + "When writing custom tests, it is essential to be aware of the interfaces of the ValidMind Dataset and ValidMind Model, which are used as input arguments.\n", + "\n", + "As a model developer, writing custom tests is beneficial when the ValidMind library lacks a built-in test for your specific needs. For example, a model might require new tests to evaluate specific aspects of the model or dataset based on a particular use case.\n", + "\n", + "This interactive notebook offers a detailed understanding of ValidMind objects and their use in writing custom tests. It introduces various interfaces provided by these objects and demonstrates how they can be leveraged to implement tests effortlessly." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "::: {.content-hidden when-format=\"html\"}\n", + "## Contents \n", + "- [About ValidMind](#toc1__) \n", + " - [Before you begin](#toc1_1__) \n", + " - [New to ValidMind?](#toc1_2__) \n", + " - [Key concepts](#toc1_3__) \n", + "- [Setting up](#toc2__) \n", + " - [Install the ValidMind Library](#toc2_1__) \n", + " - [Initialize the ValidMind Library](#toc2_2__) \n", + " - [Register sample model](#toc2_2_1__) \n", + " - [Apply documentation template](#toc2_2_2__) \n", + " - [Get your code snippet](#toc2_2_3__) \n", + "- [Load the demo dataset](#toc3__) \n", + " - [Prepocess the raw dataset](#toc3_1__) \n", + "- [Train a model for testing](#toc4__) \n", + "- [Explore basic components of the ValidMind library](#toc5__) \n", + " - [VMDataset Object](#toc5_1__) \n", + " - [Initialize the ValidMind datasets](#toc5_1_1__) \n", + " - [ Interfaces of the dataset object](#toc5_1_2__) \n", + " - [Using VM Dataset object as arguments in custom tests](#toc5_2__) \n", + " - [Run the test](#toc5_2_1__) \n", + " - [Using VM Dataset object and parameters as arguments in custom tests](#toc5_3__) \n", + " - [VMModel Object](#toc5_4__) \n", + " - [Initialize ValidMind model object](#toc5_5__) \n", + " - [Assign predictions to the datasets](#toc5_6__) \n", + " - [Using VM Model and Dataset objects as arguments in Custom tests](#toc5_7__) \n", + " - [Log the test results](#toc5_8__) \n", + "- [Where to go from here](#toc6__) \n", + " - [Discover more learning resources](#toc6_1__) \n", + "- [Upgrade ValidMind](#toc7__) \n", + "\n", + ":::\n", + "\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "## About ValidMind\n", + "\n", + "ValidMind is a suite of tools for managing model risk, including risk associated with AI and statistical models. You use the ValidMind Library to automate documentation and validation tests, and then use the ValidMind Platform to collaborate on model documentation. Together, these products simplify model risk management, facilitate compliance with regulations and institutional standards, and enhance collaboration between yourself and model validators.\n", + "\n", + "\n", + "\n", + "### Before you begin\n", + "\n", + "This notebook assumes you have basic familiarity with Python, including an understanding of how functions work. If you are new to Python, you can still run the notebook but we recommend further familiarizing yourself with the language.\n", + "\n", + "If you encounter errors due to missing modules in your Python environment, install the modules with `pip install`, and then re-run the notebook. For more help, refer to [Installing Python Modules](https://docs.python.org/3/installing/index.html).\n", + "\n", + "\n", + "\n", + "### New to ValidMind?\n", + "\n", + "If you haven't already seen our documentation on the [ValidMind Library](https://docs.validmind.ai/developer/validmind-library.html), we recommend you begin by exploring the available resources in this section. There, you can learn more about documenting models and running tests, as well as find code samples and our Python Library API reference.\n", + "\n", + "
For access to all features available in this notebook, you'll need access to a ValidMind account.\n", + "

\n", + "Register with ValidMind
\n", + "\n", + "\n", + "\n", + "### Key concepts\n", + "\n", + "Here, we will focus on ValidMind dataset, ValidMind model and tests to use these objects to generate artefacts for the documentation.\n", + "\n", + "**Tests**: A function contained in the ValidMind Library, designed to run a specific quantitative test on the dataset or model. Tests are the building blocks of ValidMind, used to evaluate and document models and datasets, and can be run individually or as part of a suite defined by your model documentation template.\n", + "\n", + "**Custom tests**: Custom tests are functions that you define to evaluate your model or dataset. These functions can be registered via the ValidMind Library to be used with the ValidMind Platform.\n", + "\n", + "**Inputs**: Objects to be evaluated and documented in the ValidMind Library. They can be any of the following:\n", + "\n", + "- **model**: A single ValidMind model object that has been initialized in ValidMind with [`vm.init_model()`](https://docs.validmind.ai/validmind/validmind.html#init_model).\n", + "- **dataset**: Single ValidMind dataset object that has been initialized in ValidMind with [`vm.init_dataset()`](https://docs.validmind.ai/validmind/validmind.html#init_dataset).\n", + "- **models**: A list of ValidMind models - usually this is used when you want to compare multiple models in your custom test.\n", + "- **datasets**: A list of ValidMind datasets - usually this is used when you want to compare multiple datasets in your custom test. See this [example](https://docs.validmind.ai/notebooks/how_to/tests/run_tests/configure_tests/run_tests_that_require_multiple_datasets.html) for more information.\n", + "\n", + "**Parameters**: Additional arguments that can be passed when running a ValidMind test, used to pass additional information to a test, customize its behavior, or provide additional context.\n", + "\n", + "**Outputs**: Tests can return elements like tables or plots. Tables may be a list of dictionaries (each representing a row) or a pandas DataFrame. Plots may be matplotlib or plotly figures.\n", + "\n", + "**Dataset based Test**\n", + "\n", + "![Dataset based test architecture](./dataset_image.png)\n", + "The dataset based tests take VM dataset object(s) as inputs, test configuration as test parameters to produce `Outputs` as mentioned above.\n", + "\n", + "**Model based Test**\n", + "\n", + "![Model based test architecture](./model_image.png)\n", + "Similar to datasest based tests, the model based tests as an additional input that is VM model object. It allows to identify prediction values of a specific model in the dataset object. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "## Setting up" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Install the ValidMind Library\n", + "\n", + "Please note the following recommended Python versions to use:\n", + "\n", + "- Python 3.7 > x <= 3.11\n", + "\n", + "To install the library:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%pip install -q validmind" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Initialize the ValidMind Library" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "#### Register sample model\n", + "\n", + "Let's first register a sample model for use with this notebook:\n", + "\n", + "1. In a browser, [log in to ValidMind](https://docs.validmind.ai/guide/configuration/log-in-to-validmind.html).\n", + "\n", + "2. In the left sidebar, navigate to **Inventory** and click **+ Register Model**.\n", + "\n", + "3. Enter the model details and click **Next >** to continue to assignment of model stakeholders. ([Need more help?](https://docs.validmind.ai/guide/model-inventory/register-models-in-inventory.html))\n", + "\n", + "4. Select your own name under the **MODEL OWNER** drop-down.\n", + "\n", + "5. Click **Register Model** to add the model to your inventory." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "#### Apply documentation template\n", + "\n", + "Once you've registered your model, let's select a documentation template. A template predefines sections for your model documentation and provides a general outline to follow, making the documentation process much easier.\n", + "\n", + "1. In the left sidebar that appears for your model, click **Documents** and select **Documentation**.\n", + "\n", + "2. Under **TEMPLATE**, select `Binary classification`.\n", + "\n", + "3. Click **Use Template** to apply the template." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "#### Get your code snippet\n", + "\n", + "ValidMind generates a unique _code snippet_ for each registered model to connect with your developer environment. You initialize the ValidMind Library with this code snippet, which ensures that your documentation and tests are uploaded to the correct model when you run the notebook.\n", + "\n", + "1. On the left sidebar that appears for your model, select **Getting Started** and click **Copy snippet to clipboard**.\n", + "2. Next, [load your model identifier credentials from an `.env` file](https://docs.validmind.ai/developer/model-documentation/store-credentials-in-env-file.html) or replace the placeholder with your own code snippet:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "metadata": {} + }, + "outputs": [], + "source": [ + "# Load your model identifier credentials from an `.env` file\n", + "\n", + "%load_ext dotenv\n", + "%dotenv .env\n", + "\n", + "# Or replace with your code snippet\n", + "\n", + "import validmind as vm\n", + "\n", + "vm.init(\n", + " # api_host=\"...\",\n", + " # api_key=\"...\",\n", + " # api_secret=\"...\",\n", + " # model=\"...\",\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%matplotlib inline\n", + "\n", + "import xgboost as xgb" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "## Load the demo dataset" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from validmind.datasets.classification import customer_churn as demo_dataset\n", + "\n", + "raw_df = demo_dataset.load_data()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Prepocess the raw dataset" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "train_df, validation_df, test_df = demo_dataset.preprocess(raw_df)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "## Train a model for testing\n", + "\n", + "We train a simple customer churn model for our test." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "x_train = train_df.drop(demo_dataset.target_column, axis=1)\n", + "y_train = train_df[demo_dataset.target_column]\n", + "x_val = validation_df.drop(demo_dataset.target_column, axis=1)\n", + "y_val = validation_df[demo_dataset.target_column]\n", + "\n", + "model = xgb.XGBClassifier(early_stopping_rounds=10)\n", + "model.set_params(\n", + " eval_metric=[\"error\", \"logloss\", \"auc\"],\n", + ")\n", + "model.fit(\n", + " x_train,\n", + " y_train,\n", + " eval_set=[(x_val, y_val)],\n", + " verbose=False,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "## Explore basic components of the ValidMind library\n", + "\n", + "In this section, you will learn about the basic objects of the ValidMind library that are necessary to implement both custom and built-in tests. As explained above, these objects are:\n", + "* VMDataset: [The high level APIs can be found here](https://docs.validmind.ai/validmind/validmind/vm_models.html#VMDataset)\n", + "* VMModel: [The high level APIs can be found here](https://docs.validmind.ai/validmind/validmind/vm_models.html#VMModel)\n", + "\n", + "Let's understand these objects and their interfaces step by step: " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "### VMDataset Object" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "#### Initialize the ValidMind datasets\n", + "\n", + "You can initialize a ValidMind dataset object using the [`init_dataset`](https://docs.validmind.ai/validmind/validmind.html#init_dataset) function from the ValidMind (`vm`) module.\n", + "\n", + "The function wraps the dataset to create a ValidMind `Dataset` object so that you can write tests effectively using the common interface provided by the VM objects. This step is always necessary every time you want to connect a dataset to documentation and produce test results through ValidMind. You only need to do it one time per dataset.\n", + "\n", + "This function takes a number of arguments. Some of the arguments are:\n", + "\n", + "- `dataset` — the raw dataset that you want to provide as input to tests\n", + "- `input_id` - a unique identifier that allows tracking what inputs are used when running each individual test\n", + "- `target_column` — a required argument if tests require access to true values. This is the name of the target column in the dataset\n", + "\n", + "The detailed list of the arguments can be found [here](https://docs.validmind.ai/validmind/validmind.html#init_dataset) " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# vm_raw_dataset is now a VMDataset object that you can pass to any ValidMind test\n", + "vm_raw_dataset = vm.init_dataset(\n", + " dataset=raw_df,\n", + " input_id=\"raw_dataset\",\n", + " target_column=\"Exited\",\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Once you have a ValidMind dataset object (VMDataset), you can inspect its attributes and methods using the inspect_obj utility module. This module provides a list of available attributes and interfaces for use in tests. Understanding how to use VMDatasets is crucial for comprehending how a custom test functions." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from validmind.utils import inspect_obj\n", + "inspect_obj(vm_raw_dataset)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "#### Interfaces of the dataset object\n", + "\n", + "**DataFrame**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "vm_raw_dataset.df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Feature columns**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "vm_raw_dataset.feature_columns" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Target column**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "vm_raw_dataset.target_column" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Features values**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "vm_raw_dataset.x_df()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Target value**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "vm_raw_dataset.y_df()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Numeric feature columns** " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "vm_raw_dataset.feature_columns_numeric" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Categorical feature columns** " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "vm_raw_dataset.feature_columns_categorical" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Similarly, you can use all other interfaces of the [VMDataset objects](https://docs.validmind.ai/validmind/validmind/vm_models.html#VMDataset) " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Using VM Dataset object as arguments in custom tests\n", + "\n", + "A custom test is simply a Python function that takes two types of arguments: `inputs` and `params`. The `inputs` are ValidMind objects (`VMDataset`, `VMModel`), and the `params` are additional parameters required for the underlying computation of the test. We will discuss both types of arguments in the following sections.\n", + "\n", + "Let's start with a custom test that requires only a ValidMind dataset object. In this example, we will check the balance of classes in the target column of the dataset:\n", + "\n", + "- The custom test below requires a single argument of type `VMDataset` (dataset).\n", + "- The `my_custom_tests.ClassImbalance` is a unique test identifier that can be assigned using the `vm.test` decorator functionality. This unique test ID will be used in the platform to load test results in the documentation.\n", + "- The `dataset.target_column` and `dataset.df` attributes of the `VMDataset` object are used in the test.\n", + "\n", + "Other high-level APIs (attributes and methods) of the dataset object are listed [here](https://docs.validmind.ai/validmind/validmind/vm_models.html#VMDataset).\n", + "\n", + "If you've gone through the [Implement custom tests notebook](../tests/custom_tests/implement_custom_tests.ipynb), you should have a good understanding of how custom tests are implemented in details. If you haven't, we recommend going through that notebook first." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from validmind.vm_models.dataset.dataset import VMDataset\n", + "import pandas as pd\n", + "\n", + "@vm.test(\"my_custom_tests.ClassImbalance\")\n", + "def class_imbalance(dataset):\n", + " # Can only run this test if we have a Dataset object\n", + " if not isinstance(dataset, VMDataset):\n", + " raise ValueError(\"ClassImbalance requires a validmind Dataset object\")\n", + "\n", + " if dataset.target_column is None:\n", + " print(\"Skipping class_imbalance test because no target column is defined\")\n", + " return\n", + "\n", + " # VMDataset object provides target_column attribute\n", + " target_column = dataset.target_column\n", + " # we can access pandas DataFrame using df attribute\n", + " imbalance_percentages = dataset.df[target_column].value_counts(\n", + " normalize=True\n", + " )\n", + " classes = list(imbalance_percentages.index) \n", + " percentages = list(imbalance_percentages.values * 100)\n", + "\n", + " return pd.DataFrame({\"Classes\":classes, \"Percentage\": percentages})" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "#### Run the test\n", + "\n", + "Let's run the test using the `run_test` method, which is part of the `validmind.tests` module. Here, we pass the `dataset` through the `inputs`. Similarly, you can pass `datasets`, `model`, or `models` as inputs if your custom test requires them. In this example below, we run the custom test `my_custom_tests.ClassImbalance` by passing the `dataset` through the `inputs`. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from validmind.tests import run_test\n", + "result = run_test(\n", + " test_id=\"my_custom_tests.ClassImbalance\",\n", + " inputs={\n", + " \"dataset\": vm_raw_dataset\n", + " }\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can move custom tests into separate modules in a folder. It allows you to take one-off tests and move them into an organized structure that makes it easier to manage, maintain and share them. We have provided a seperate notebook with detailed explaination [here](../tests/custom_tests/integrate_external_test_providers.ipynb) " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Using VM Dataset object and parameters as arguments in custom tests\n", + "\n", + "Simlilar to `inputs`, you can pass `params` to a custom test by providing a dictionary of parameters to the `run_test()` function. The parameters will override any default parameters set in the custom test definition. Note that the `dataset` is still passed as `inputs`. \n", + "Let's modify the class imbalance test so that it provides flexibility to `normalize` the results." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from validmind.vm_models.dataset.dataset import VMDataset\n", + "import pandas as pd\n", + "\n", + "@vm.test(\"my_custom_tests.ClassImbalance\")\n", + "def class_imbalance(dataset, normalize=True):\n", + " # Can only run this test if we have a Dataset object\n", + " if not isinstance(dataset, VMDataset):\n", + " raise ValueError(\"ClassImbalance requires a validmind Dataset object\")\n", + "\n", + " if dataset.target_column is None:\n", + " print(\"Skipping class_imbalance test because no target column is defined\")\n", + " return\n", + "\n", + " # VMDataset object provides target_column attribute\n", + " target_column = dataset.target_column\n", + " # we can access pandas DataFrame using df attribute\n", + " imbalance_percentages = dataset.df[target_column].value_counts(\n", + " normalize=normalize\n", + " )\n", + " classes = list(imbalance_percentages.index) \n", + " if normalize: \n", + " result = pd.DataFrame({\"Classes\":classes, \"Percentage\": list(imbalance_percentages.values*100)})\n", + " else:\n", + " result = pd.DataFrame({\"Classes\":classes, \"Count\": list(imbalance_percentages.values)})\n", + " return result" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In this example, the `normalize` parameter is set to `False`, so the class counts will not be normalized. You can change the value to `True` if you want the counts to be normalized. The results of the test will reflect this flexibility, allowing for different outputs based on the parameter passed.\n", + "\n", + "Here, we have passed the `dataset` through the `inputs` and the `normalize` parameter using the `params`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from validmind.tests import run_test\n", + "result = run_test(\n", + " test_id = \"my_custom_tests.ClassImbalance\",\n", + " inputs={\"dataset\": vm_raw_dataset},\n", + " params={\"normalize\": True},\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "### VMModel Object" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Initialize ValidMind model object\n", + "\n", + "Similar to ValidMind `Dataset` object, you can initialize a ValidMind Model object using the [`init_model`](https://docs.validmind.ai/validmind/validmind.html#init_model) function from the ValidMind (`vm`) module.\n", + "\n", + "This function takes a number of arguments. Some of the arguments are:\n", + "\n", + "- `model` — the raw model that you want evaluate\n", + "- `input_id` - a unique identifier that allows tracking what inputs are used when running each individual test\n", + "\n", + "The detailed list of the arguments can be found [here](https://docs.validmind.ai/validmind/validmind.html#init_model) " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "vm_model = vm.init_model(\n", + " model=model,\n", + " input_id=\"xgb_model\",\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's inspect the methods and attributes of the model now:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "inspect_obj(vm_model)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Assign predictions to the datasets\n", + "\n", + "We can now use the `assign_predictions()` method from the `Dataset` object to link existing predictions to any model. If no prediction values are passed, the method will compute predictions automatically:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "vm_train_ds = vm.init_dataset(\n", + " input_id=\"train_dataset\",\n", + " dataset=train_df,\n", + " type=\"generic\",\n", + " target_column=demo_dataset.target_column,\n", + ")\n", + "\n", + "vm_train_ds.assign_predictions(model=vm_model)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can see below, the extra prediction column (`xgb_model_prediction`) for the model (`xgb_model`) has been added in the dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(vm_train_ds)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Using VM Model and Dataset objects as arguments in Custom tests\n", + "\n", + "We will now create a `@vm.test` wrapper that will allow you to create a reusable test. Note the following changes in the code below:\n", + "\n", + "- The function `confusion_matrix` takes two arguments `dataset` and `model`. This is a `VMDataset` and `VMModel` object respectively.\n", + " - `VMDataset` objects allow you to access the dataset's true (target) values by accessing the `.y` attribute.\n", + " - `VMDataset` objects allow you to access the predictions for a given model by accessing the `.y_pred()` method.\n", + "- The function docstring provides a description of what the test does. This will be displayed along with the result in this notebook as well as in the ValidMind Platform.\n", + "- The function body calculates the confusion matrix using the `sklearn.tests.confusion_matrix` function as we just did above.\n", + "- The function then returns the `ConfusionMatrixDisplay.figure_` object - this is important as the ValidMind Library expects the output of the custom test to be a plot or a table.\n", + "- The `@vm.test` decorator is doing the work of creating a wrapper around the function that will allow it to be run by the ValidMind Library. It also registers the test so it can be found by the ID `my_custom_tests.ConfusionMatrix` (see the section below on how test IDs work in ValidMind and why this format is important)\n", + "\n", + "Similarly, you can use the functinality provided by `VMDataset` and `VMModel` objects. You can refer our documentation page for all the avalialble APIs [here](https://docs.validmind.ai/validmind/validmind.html#init_dataset)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn import metrics\n", + "import matplotlib.pyplot as plt\n", + "@vm.test(\"my_custom_tests.ConfusionMatrix\")\n", + "def confusion_matrix(dataset, model):\n", + " \"\"\"The confusion matrix is a table that is often used to describe the performance of a classification model on a set of data for which the true values are known.\n", + "\n", + " The confusion matrix is a 2x2 table that contains 4 values:\n", + "\n", + " - True Positive (TP): the number of correct positive predictions\n", + " - True Negative (TN): the number of correct negative predictions\n", + " - False Positive (FP): the number of incorrect positive predictions\n", + " - False Negative (FN): the number of incorrect negative predictions\n", + "\n", + " The confusion matrix can be used to assess the holistic performance of a classification model by showing the accuracy, precision, recall, and F1 score of the model on a single figure.\n", + " \"\"\"\n", + " # we can retrieve traget value from dataset which is y attribute\n", + " y_true = dataset.y\n", + " # The prediction value of a specific model using y_pred method \n", + " y_pred = dataset.y_pred(model=model)\n", + "\n", + " confusion_matrix = metrics.confusion_matrix(y_true, y_pred)\n", + "\n", + " cm_display = metrics.ConfusionMatrixDisplay(\n", + " confusion_matrix=confusion_matrix, display_labels=[False, True]\n", + " )\n", + " cm_display.plot()\n", + " plt.close()\n", + "\n", + " return cm_display.figure_ # return the figure object itself" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here, we run test using two inputs; `dataset` and `model`. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from validmind.tests import run_test\n", + "result = run_test(\n", + " test_id = \"my_custom_tests.ConfusionMatrix\",\n", + " inputs={\n", + " \"dataset\": vm_train_ds,\n", + " \"model\": vm_model,\n", + " }\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Log the test results\n", + "\n", + "You can log any test result to the ValidMind Platform with the `.log()` method of the result object. This will allow you to add the result to the documentation.\n", + "\n", + "You can now do the same for the confusion matrix results." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "result.log()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "## Where to go from here\n", + "\n", + "In this notebook you have learned the end-to-end process to document a model with the ValidMind Library, running through some very common scenarios in a typical model development setting:\n", + "\n", + "- Running out-of-the-box tests\n", + "- Documenting your model by adding evidence to model documentation\n", + "- Extending the capabilities of the ValidMind Library by implementing custom tests\n", + "- Ensuring that the documentation is complete by running all tests in the documentation template\n", + "\n", + "\n", + "\n", + "### Discover more learning resources\n", + "\n", + "We offer many interactive notebooks to help you automate testing, documenting, validating, and more:\n", + "\n", + "- [Run tests & test suites](https://docs.validmind.ai/developer/how-to/testing-overview.html)\n", + "- [Use ValidMind Library features](https://docs.validmind.ai/developer/how-to/feature-overview.html)\n", + "- [Code samples by use case](https://docs.validmind.ai/guide/samples-jupyter-notebooks.html)\n", + "\n", + "Or, visit our [documentation](https://docs.validmind.ai/) to learn more about ValidMind." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "## Upgrade ValidMind\n", + "\n", + "
After installing ValidMind, you’ll want to periodically make sure you are on the latest version to access any new features and other enhancements.
\n", + "\n", + "Retrieve the information for the currently installed version of ValidMind:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%pip show validmind" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If the version returned is lower than the version indicated in our [production open-source code](https://github.com/validmind/validmind-library/blob/prod/validmind/__version__.py), restart your notebook and run:\n", + "\n", + "```bash\n", + "%pip install --upgrade validmind\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You may need to restart your kernel after running the upgrade package for changes to be applied." + ] + }, + { + "cell_type": "markdown", + "id": "copyright-6841287fee5e4319a84276ef23c34e1a", + "metadata": {}, + "source": [ + "\n", + "\n", + "\n", + "\n", + "***\n", + "\n", + "Copyright © 2023-2026 ValidMind Inc. All rights reserved.
\n", + "Refer to [LICENSE](https://github.com/validmind/validmind-library/blob/main/LICENSE) for details.
\n", + "SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial
" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.14" + } + }, + "nbformat": 4, + "nbformat_minor": 2 } diff --git a/notebooks/how_to/tests/explore_tests/explore_test_suites.ipynb b/notebooks/how_to/tests/explore_tests/explore_test_suites.ipynb index 2632fbb96..a17bfc362 100644 --- a/notebooks/how_to/tests/explore_tests/explore_test_suites.ipynb +++ b/notebooks/how_to/tests/explore_tests/explore_test_suites.ipynb @@ -1,798 +1,799 @@ { - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Explore test suites\n", - "\n", - "Explore ValidMind test suites, pre-built collections of related tests used to evaluate specific aspects of your model. Retrieve available test suites and details for tests within a suite to understand their functionality, allowing you to select the appropriate test suites for your use cases." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "::: {.content-hidden when-format=\"html\"}\n", - "## Contents \n", - "- [Contents](#toc1__) \n", - "- [About ValidMind](#toc2__) \n", - " - [Before you begin](#toc2_1__) \n", - " - [New to ValidMind?](#toc2_2__) \n", - " - [Key concepts](#toc2_3__) \n", - "- [Install the ValidMind Library](#toc3__) \n", - "- [List available test suites](#toc4__) \n", - "- [View test suite details](#toc5__) \n", - " - [View test details](#toc5_1__) \n", - "- [Next steps](#toc6__) \n", - " - [Discover more learning resources](#toc6_1__) \n", - "- [Upgrade ValidMind](#toc7__) \n", - "\n", - ":::\n", - "\n", - "" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "::: {.content-hidden when-format=\"html\"}\n", - "\n", - "\n", - "## Contents\n", - "- [About ValidMind](#toc1_) \n", - " - [Before you begin](#toc1_1_) \n", - " - [New to ValidMind?](#toc1_2_) \n", - " - [Key concepts](#toc1_3_) \n", - "- [Install the ValidMind Library](#toc2_) \n", - "- [List available test suites](#toc3_) \n", - "- [View test suite details](#toc4_) \n", - " - [View test details](#toc4_1_) \n", - "- [Next steps](#toc5_) \n", - " - [Discover more learning resources](#toc5_1_)\n", - " \n", - ":::\n", - "\n", - "" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "## About ValidMind\n", - "\n", - "ValidMind is a suite of tools for managing model risk, including risk associated with AI and statistical models.\n", - "\n", - "You use the ValidMind Library to automate documentation and validation tests, and then use the ValidMind Platform to collaborate on model documentation. Together, these products simplify model risk management, facilitate compliance with regulations and institutional standards, and enhance collaboration between yourself and model validators.\n", - "\n", - "\n", - "\n", - "### Before you begin\n", - "\n", - "This notebook assumes you have basic familiarity with Python, including an understanding of how functions work. If you are new to Python, you can still run the notebook but we recommend further familiarizing yourself with the language. \n", - "\n", - "If you encounter errors due to missing modules in your Python environment, install the modules with `pip install`, and then re-run the notebook. For more help, refer to [Installing Python Modules](https://docs.python.org/3/installing/index.html).\n", - "\n", - "\n", - "\n", - "### New to ValidMind?\n", - "\n", - "If you haven't already seen our documentation on the [ValidMind Library](https://docs.validmind.ai/developer/validmind-library.html), we recommend you begin by exploring the available resources in this section. There, you can learn more about documenting models and running tests, as well as find code samples and our Python Library API reference.\n", - "\n", - "
For access to all features available in this notebook, you'll need access to a ValidMind account.\n", - "

\n", - "Register with ValidMind
\n", - "\n", - "\n", - "\n", - "### Key concepts\n", - "\n", - "**Model documentation**: A structured and detailed record pertaining to a model, encompassing key components such as its underlying assumptions, methodologies, data sources, inputs, performance metrics, evaluations, limitations, and intended uses. It serves to ensure transparency, adherence to regulatory requirements, and a clear understanding of potential risks associated with the model’s application.\n", - "\n", - "**Documentation template**: Functions as a test suite and lays out the structure of model documentation, segmented into various sections and sub-sections. Documentation templates define the structure of your model documentation, specifying the tests that should be run, and how the results should be displayed.\n", - "\n", - "**Tests**: A function contained in the ValidMind Library, designed to run a specific quantitative test on the dataset or model. Tests are the building blocks of ValidMind, used to evaluate and document models and datasets, and can be run individually or as part of a suite defined by your model documentation template.\n", - "\n", - "**Custom tests**: Custom tests are functions that you define to evaluate your model or dataset. These functions can be registered via the ValidMind Library to be used with the ValidMind Platform.\n", - "\n", - "**Inputs**: Objects to be evaluated and documented in the ValidMind Library. They can be any of the following:\n", - "\n", - " - **model**: A single model that has been initialized in ValidMind with [`vm.init_model()`](https://docs.validmind.ai/validmind/validmind.html#init_model).\n", - " - **dataset**: Single dataset that has been initialized in ValidMind with [`vm.init_dataset()`](https://docs.validmind.ai/validmind/validmind.html#init_dataset).\n", - " - **models**: A list of ValidMind models - usually this is used when you want to compare multiple models in your custom test.\n", - " - **datasets**: A list of ValidMind datasets - usually this is used when you want to compare multiple datasets in your custom test. See this [example](https://docs.validmind.ai/notebooks/how_to/tests/run_tests/configure_tests/run_tests_that_require_multiple_datasets.html) for more information.\n", - "\n", - "**Parameters**: Additional arguments that can be passed when running a ValidMind test, used to pass additional information to a test, customize its behavior, or provide additional context.\n", - "\n", - "**Outputs**: Custom tests can return elements like tables or plots. Tables may be a list of dictionaries (each representing a row) or a pandas DataFrame. Plots may be matplotlib or plotly figures.\n", - "\n", - "**Test suites**: Collections of tests designed to run together to automate and generate model documentation end-to-end for specific use-cases.\n", - "\n", - "Example: the [`classifier_full_suite`](https://docs.validmind.ai/validmind/validmind/test_suites/classifier.html#ClassifierFullSuite) test suite runs tests from the [`tabular_dataset`](https://docs.validmind.ai/validmind/validmind/test_suites/tabular_datasets.html) and [`classifier`](https://docs.validmind.ai/validmind/validmind/test_suites/classifier.html) test suites to fully document the data and model sections for binary classification model use-cases." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "## Install the ValidMind Library\n", - "\n", - "
Recommended Python versions\n", - "

\n", - "Python 3.8 <= x <= 3.11
\n", - "\n", - "To install the library:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%pip install -q validmind" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "## List available test suites\n", - "After we import the ValidMind Library, we'll call [test_suites.list_suites()](https://docs.validmind.ai/validmind/validmind/test_suites.html#list_suites) to retrieve a structured list of all available test suites, that includes each suite's name, description, and associated tests:" - ] - }, + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Explore test suites\n", + "\n", + "Explore ValidMind test suites, pre-built collections of related tests used to evaluate specific aspects of your model. Retrieve available test suites and details for tests within a suite to understand their functionality, allowing you to select the appropriate test suites for your use cases." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "::: {.content-hidden when-format=\"html\"}\n", + "## Contents \n", + "- [Contents](#toc1__) \n", + "- [About ValidMind](#toc2__) \n", + " - [Before you begin](#toc2_1__) \n", + " - [New to ValidMind?](#toc2_2__) \n", + " - [Key concepts](#toc2_3__) \n", + "- [Install the ValidMind Library](#toc3__) \n", + "- [List available test suites](#toc4__) \n", + "- [View test suite details](#toc5__) \n", + " - [View test details](#toc5_1__) \n", + "- [Next steps](#toc6__) \n", + " - [Discover more learning resources](#toc6_1__) \n", + "- [Upgrade ValidMind](#toc7__) \n", + "\n", + ":::\n", + "\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "::: {.content-hidden when-format=\"html\"}\n", + "\n", + "\n", + "## Contents\n", + "- [About ValidMind](#toc1_) \n", + " - [Before you begin](#toc1_1_) \n", + " - [New to ValidMind?](#toc1_2_) \n", + " - [Key concepts](#toc1_3_) \n", + "- [Install the ValidMind Library](#toc2_) \n", + "- [List available test suites](#toc3_) \n", + "- [View test suite details](#toc4_) \n", + " - [View test details](#toc4_1_) \n", + "- [Next steps](#toc5_) \n", + " - [Discover more learning resources](#toc5_1_)\n", + " \n", + ":::\n", + "\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "## About ValidMind\n", + "\n", + "ValidMind is a suite of tools for managing model risk, including risk associated with AI and statistical models.\n", + "\n", + "You use the ValidMind Library to automate documentation and validation tests, and then use the ValidMind Platform to collaborate on model documentation. Together, these products simplify model risk management, facilitate compliance with regulations and institutional standards, and enhance collaboration between yourself and model validators.\n", + "\n", + "\n", + "\n", + "### Before you begin\n", + "\n", + "This notebook assumes you have basic familiarity with Python, including an understanding of how functions work. If you are new to Python, you can still run the notebook but we recommend further familiarizing yourself with the language. \n", + "\n", + "If you encounter errors due to missing modules in your Python environment, install the modules with `pip install`, and then re-run the notebook. For more help, refer to [Installing Python Modules](https://docs.python.org/3/installing/index.html).\n", + "\n", + "\n", + "\n", + "### New to ValidMind?\n", + "\n", + "If you haven't already seen our documentation on the [ValidMind Library](https://docs.validmind.ai/developer/validmind-library.html), we recommend you begin by exploring the available resources in this section. There, you can learn more about documenting models and running tests, as well as find code samples and our Python Library API reference.\n", + "\n", + "
For access to all features available in this notebook, you'll need access to a ValidMind account.\n", + "

\n", + "Register with ValidMind
\n", + "\n", + "\n", + "\n", + "### Key concepts\n", + "\n", + "**Model documentation**: A structured and detailed record pertaining to a model, encompassing key components such as its underlying assumptions, methodologies, data sources, inputs, performance metrics, evaluations, limitations, and intended uses. It serves to ensure transparency, adherence to regulatory requirements, and a clear understanding of potential risks associated with the model’s application.\n", + "\n", + "**Documentation template**: Functions as a test suite and lays out the structure of model documentation, segmented into various sections and sub-sections. Documentation templates define the structure of your model documentation, specifying the tests that should be run, and how the results should be displayed.\n", + "\n", + "**Tests**: A function contained in the ValidMind Library, designed to run a specific quantitative test on the dataset or model. Tests are the building blocks of ValidMind, used to evaluate and document models and datasets, and can be run individually or as part of a suite defined by your model documentation template.\n", + "\n", + "**Custom tests**: Custom tests are functions that you define to evaluate your model or dataset. These functions can be registered via the ValidMind Library to be used with the ValidMind Platform.\n", + "\n", + "**Inputs**: Objects to be evaluated and documented in the ValidMind Library. They can be any of the following:\n", + "\n", + " - **model**: A single model that has been initialized in ValidMind with [`vm.init_model()`](https://docs.validmind.ai/validmind/validmind.html#init_model).\n", + " - **dataset**: Single dataset that has been initialized in ValidMind with [`vm.init_dataset()`](https://docs.validmind.ai/validmind/validmind.html#init_dataset).\n", + " - **models**: A list of ValidMind models - usually this is used when you want to compare multiple models in your custom test.\n", + " - **datasets**: A list of ValidMind datasets - usually this is used when you want to compare multiple datasets in your custom test. See this [example](https://docs.validmind.ai/notebooks/how_to/tests/run_tests/configure_tests/run_tests_that_require_multiple_datasets.html) for more information.\n", + "\n", + "**Parameters**: Additional arguments that can be passed when running a ValidMind test, used to pass additional information to a test, customize its behavior, or provide additional context.\n", + "\n", + "**Outputs**: Custom tests can return elements like tables or plots. Tables may be a list of dictionaries (each representing a row) or a pandas DataFrame. Plots may be matplotlib or plotly figures.\n", + "\n", + "**Test suites**: Collections of tests designed to run together to automate and generate model documentation end-to-end for specific use-cases.\n", + "\n", + "Example: the [`classifier_full_suite`](https://docs.validmind.ai/validmind/validmind/test_suites/classifier.html#ClassifierFullSuite) test suite runs tests from the [`tabular_dataset`](https://docs.validmind.ai/validmind/validmind/test_suites/tabular_datasets.html) and [`classifier`](https://docs.validmind.ai/validmind/validmind/test_suites/classifier.html) test suites to fully document the data and model sections for binary classification model use-cases." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "## Install the ValidMind Library\n", + "\n", + "
Recommended Python versions\n", + "

\n", + "Python 3.8 <= x <= 3.11
\n", + "\n", + "To install the library:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%pip install -q validmind" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "## List available test suites\n", + "After we import the ValidMind Library, we'll call [test_suites.list_suites()](https://docs.validmind.ai/validmind/validmind/test_suites.html#list_suites) to retrieve a structured list of all available test suites, that includes each suite's name, description, and associated tests:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
IDNameDescriptionTests
classifier_model_diagnosisClassifierDiagnosisTest suite for sklearn classifier model diagnosis testsvalidmind.model_validation.sklearn.OverfitDiagnosis, validmind.model_validation.sklearn.WeakspotsDiagnosis, validmind.model_validation.sklearn.RobustnessDiagnosis
classifier_full_suiteClassifierFullSuiteFull test suite for binary classification models.validmind.data_validation.DatasetDescription, validmind.data_validation.DescriptiveStatistics, validmind.data_validation.PearsonCorrelationMatrix, validmind.data_validation.ClassImbalance, validmind.data_validation.Duplicates, validmind.data_validation.HighCardinality, validmind.data_validation.HighPearsonCorrelation, validmind.data_validation.MissingValues, validmind.data_validation.Skewness, validmind.data_validation.UniqueRows, validmind.data_validation.TooManyZeroValues, validmind.model_validation.ModelMetadata, validmind.data_validation.DatasetSplit, validmind.model_validation.sklearn.ConfusionMatrix, validmind.model_validation.sklearn.ClassifierPerformance, validmind.model_validation.sklearn.PermutationFeatureImportance, validmind.model_validation.sklearn.PrecisionRecallCurve, validmind.model_validation.sklearn.ROCCurve, validmind.model_validation.sklearn.PopulationStabilityIndex, validmind.model_validation.sklearn.SHAPGlobalImportance, validmind.model_validation.sklearn.MinimumAccuracy, validmind.model_validation.sklearn.MinimumF1Score, validmind.model_validation.sklearn.MinimumROCAUCScore, validmind.model_validation.sklearn.TrainingTestDegradation, validmind.model_validation.sklearn.ModelsPerformanceComparison, validmind.model_validation.sklearn.OverfitDiagnosis, validmind.model_validation.sklearn.WeakspotsDiagnosis, validmind.model_validation.sklearn.RobustnessDiagnosis
classifier_metricsClassifierMetricsTest suite for sklearn classifier metricsvalidmind.model_validation.ModelMetadata, validmind.data_validation.DatasetSplit, validmind.model_validation.sklearn.ConfusionMatrix, validmind.model_validation.sklearn.ClassifierPerformance, validmind.model_validation.sklearn.PermutationFeatureImportance, validmind.model_validation.sklearn.PrecisionRecallCurve, validmind.model_validation.sklearn.ROCCurve, validmind.model_validation.sklearn.PopulationStabilityIndex, validmind.model_validation.sklearn.SHAPGlobalImportance
classifier_model_validationClassifierModelValidationTest suite for binary classification models.validmind.model_validation.ModelMetadata, validmind.data_validation.DatasetSplit, validmind.model_validation.sklearn.ConfusionMatrix, validmind.model_validation.sklearn.ClassifierPerformance, validmind.model_validation.sklearn.PermutationFeatureImportance, validmind.model_validation.sklearn.PrecisionRecallCurve, validmind.model_validation.sklearn.ROCCurve, validmind.model_validation.sklearn.PopulationStabilityIndex, validmind.model_validation.sklearn.SHAPGlobalImportance, validmind.model_validation.sklearn.MinimumAccuracy, validmind.model_validation.sklearn.MinimumF1Score, validmind.model_validation.sklearn.MinimumROCAUCScore, validmind.model_validation.sklearn.TrainingTestDegradation, validmind.model_validation.sklearn.ModelsPerformanceComparison, validmind.model_validation.sklearn.OverfitDiagnosis, validmind.model_validation.sklearn.WeakspotsDiagnosis, validmind.model_validation.sklearn.RobustnessDiagnosis
classifier_validationClassifierPerformanceTest suite for sklearn classifier modelsvalidmind.model_validation.sklearn.MinimumAccuracy, validmind.model_validation.sklearn.MinimumF1Score, validmind.model_validation.sklearn.MinimumROCAUCScore, validmind.model_validation.sklearn.TrainingTestDegradation, validmind.model_validation.sklearn.ModelsPerformanceComparison
cluster_full_suiteClusterFullSuiteFull test suite for clustering models.validmind.model_validation.ModelMetadata, validmind.data_validation.DatasetSplit, validmind.model_validation.sklearn.HomogeneityScore, validmind.model_validation.sklearn.CompletenessScore, validmind.model_validation.sklearn.VMeasure, validmind.model_validation.sklearn.AdjustedRandIndex, validmind.model_validation.sklearn.AdjustedMutualInformation, validmind.model_validation.sklearn.FowlkesMallowsScore, validmind.model_validation.sklearn.ClusterPerformanceMetrics, validmind.model_validation.sklearn.ClusterCosineSimilarity, validmind.model_validation.sklearn.SilhouettePlot, validmind.model_validation.ClusterSizeDistribution, validmind.model_validation.sklearn.HyperParametersTuning, validmind.model_validation.sklearn.KMeansClustersOptimization
cluster_metricsClusterMetricsTest suite for sklearn clustering metricsvalidmind.model_validation.ModelMetadata, validmind.data_validation.DatasetSplit, validmind.model_validation.sklearn.HomogeneityScore, validmind.model_validation.sklearn.CompletenessScore, validmind.model_validation.sklearn.VMeasure, validmind.model_validation.sklearn.AdjustedRandIndex, validmind.model_validation.sklearn.AdjustedMutualInformation, validmind.model_validation.sklearn.FowlkesMallowsScore, validmind.model_validation.sklearn.ClusterPerformanceMetrics, validmind.model_validation.sklearn.ClusterCosineSimilarity, validmind.model_validation.sklearn.SilhouettePlot
cluster_performanceClusterPerformanceTest suite for sklearn cluster performancevalidmind.model_validation.ClusterSizeDistribution
embeddings_full_suiteEmbeddingsFullSuiteFull test suite for embeddings models.validmind.model_validation.ModelMetadata, validmind.data_validation.DatasetSplit, validmind.model_validation.embeddings.DescriptiveAnalytics, validmind.model_validation.embeddings.CosineSimilarityDistribution, validmind.model_validation.embeddings.ClusterDistribution, validmind.model_validation.embeddings.EmbeddingsVisualization2D, validmind.model_validation.embeddings.StabilityAnalysisRandomNoise, validmind.model_validation.embeddings.StabilityAnalysisSynonyms, validmind.model_validation.embeddings.StabilityAnalysisKeyword, validmind.model_validation.embeddings.StabilityAnalysisTranslation
embeddings_metricsEmbeddingsMetricsTest suite for embeddings metricsvalidmind.model_validation.ModelMetadata, validmind.data_validation.DatasetSplit, validmind.model_validation.embeddings.DescriptiveAnalytics, validmind.model_validation.embeddings.CosineSimilarityDistribution, validmind.model_validation.embeddings.ClusterDistribution, validmind.model_validation.embeddings.EmbeddingsVisualization2D
embeddings_model_performanceEmbeddingsPerformanceTest suite for embeddings model performancevalidmind.model_validation.embeddings.StabilityAnalysisRandomNoise, validmind.model_validation.embeddings.StabilityAnalysisSynonyms, validmind.model_validation.embeddings.StabilityAnalysisKeyword, validmind.model_validation.embeddings.StabilityAnalysisTranslation
hyper_parameters_optimizationKmeansParametersOptimizationTest suite for sklearn hyperparameters optimizationvalidmind.model_validation.sklearn.HyperParametersTuning, validmind.model_validation.sklearn.KMeansClustersOptimization
llm_classifier_full_suiteLLMClassifierFullSuiteFull test suite for LLM classification models.validmind.data_validation.ClassImbalance, validmind.data_validation.Duplicates, validmind.data_validation.nlp.StopWords, validmind.data_validation.nlp.Punctuations, validmind.data_validation.nlp.CommonWords, validmind.data_validation.nlp.TextDescription, validmind.model_validation.ModelMetadata, validmind.data_validation.DatasetSplit, validmind.model_validation.sklearn.ConfusionMatrix, validmind.model_validation.sklearn.ClassifierPerformance, validmind.model_validation.sklearn.PermutationFeatureImportance, validmind.model_validation.sklearn.PrecisionRecallCurve, validmind.model_validation.sklearn.ROCCurve, validmind.model_validation.sklearn.PopulationStabilityIndex, validmind.model_validation.sklearn.SHAPGlobalImportance, validmind.model_validation.sklearn.MinimumAccuracy, validmind.model_validation.sklearn.MinimumF1Score, validmind.model_validation.sklearn.MinimumROCAUCScore, validmind.model_validation.sklearn.TrainingTestDegradation, validmind.model_validation.sklearn.ModelsPerformanceComparison, validmind.model_validation.sklearn.OverfitDiagnosis, validmind.model_validation.sklearn.WeakspotsDiagnosis, validmind.model_validation.sklearn.RobustnessDiagnosis, validmind.prompt_validation.Bias, validmind.prompt_validation.Clarity, validmind.prompt_validation.Conciseness, validmind.prompt_validation.Delimitation, validmind.prompt_validation.NegativeInstruction, validmind.prompt_validation.Robustness, validmind.prompt_validation.Specificity
prompt_validationPromptValidationTest suite for prompt validationvalidmind.prompt_validation.Bias, validmind.prompt_validation.Clarity, validmind.prompt_validation.Conciseness, validmind.prompt_validation.Delimitation, validmind.prompt_validation.NegativeInstruction, validmind.prompt_validation.Robustness, validmind.prompt_validation.Specificity
nlp_classifier_full_suiteNLPClassifierFullSuiteFull test suite for NLP classification models.validmind.data_validation.ClassImbalance, validmind.data_validation.Duplicates, validmind.data_validation.nlp.StopWords, validmind.data_validation.nlp.Punctuations, validmind.data_validation.nlp.CommonWords, validmind.data_validation.nlp.TextDescription, validmind.model_validation.ModelMetadata, validmind.data_validation.DatasetSplit, validmind.model_validation.sklearn.ConfusionMatrix, validmind.model_validation.sklearn.ClassifierPerformance, validmind.model_validation.sklearn.PermutationFeatureImportance, validmind.model_validation.sklearn.PrecisionRecallCurve, validmind.model_validation.sklearn.ROCCurve, validmind.model_validation.sklearn.PopulationStabilityIndex, validmind.model_validation.sklearn.SHAPGlobalImportance, validmind.model_validation.sklearn.MinimumAccuracy, validmind.model_validation.sklearn.MinimumF1Score, validmind.model_validation.sklearn.MinimumROCAUCScore, validmind.model_validation.sklearn.TrainingTestDegradation, validmind.model_validation.sklearn.ModelsPerformanceComparison, validmind.model_validation.sklearn.OverfitDiagnosis, validmind.model_validation.sklearn.WeakspotsDiagnosis, validmind.model_validation.sklearn.RobustnessDiagnosis
regression_metricsRegressionMetricsTest suite for performance metrics of regression metricsvalidmind.data_validation.DatasetSplit, validmind.model_validation.ModelMetadata, validmind.model_validation.sklearn.PermutationFeatureImportance
regression_model_descriptionRegressionModelDescriptionTest suite for performance metric of regression model of statsmodels libraryvalidmind.data_validation.DatasetSplit, validmind.model_validation.ModelMetadata
regression_models_evaluationRegressionModelsEvaluationTest suite for metrics comparison of regression model of statsmodels libraryvalidmind.model_validation.statsmodels.RegressionModelCoeffs, validmind.model_validation.sklearn.RegressionModelsPerformanceComparison
regression_full_suiteRegressionFullSuiteFull test suite for regression models.validmind.data_validation.DatasetDescription, validmind.data_validation.DescriptiveStatistics, validmind.data_validation.PearsonCorrelationMatrix, validmind.data_validation.ClassImbalance, validmind.data_validation.Duplicates, validmind.data_validation.HighCardinality, validmind.data_validation.HighPearsonCorrelation, validmind.data_validation.MissingValues, validmind.data_validation.Skewness, validmind.data_validation.UniqueRows, validmind.data_validation.TooManyZeroValues, validmind.data_validation.DatasetSplit, validmind.model_validation.ModelMetadata, validmind.model_validation.sklearn.PermutationFeatureImportance, validmind.model_validation.sklearn.RegressionErrors, validmind.model_validation.sklearn.RegressionR2Square
regression_performanceRegressionPerformanceTest suite for regression model performancevalidmind.model_validation.sklearn.RegressionErrors, validmind.model_validation.sklearn.RegressionR2Square
summarization_metricsSummarizationMetricsTest suite for Summarization metricsvalidmind.model_validation.TokenDisparity, validmind.model_validation.BleuScore, validmind.model_validation.BertScore, validmind.model_validation.ContextualRecall
tabular_datasetTabularDatasetTest suite for tabular datasets.validmind.data_validation.DatasetDescription, validmind.data_validation.DescriptiveStatistics, validmind.data_validation.PearsonCorrelationMatrix, validmind.data_validation.ClassImbalance, validmind.data_validation.Duplicates, validmind.data_validation.HighCardinality, validmind.data_validation.HighPearsonCorrelation, validmind.data_validation.MissingValues, validmind.data_validation.Skewness, validmind.data_validation.UniqueRows, validmind.data_validation.TooManyZeroValues
tabular_dataset_descriptionTabularDatasetDescriptionTest suite to extract metadata and descriptive\n", - "statistics from a tabular datasetvalidmind.data_validation.DatasetDescription, validmind.data_validation.DescriptiveStatistics, validmind.data_validation.PearsonCorrelationMatrix
tabular_data_qualityTabularDataQualityTest suite for data quality on tabular datasetsvalidmind.data_validation.ClassImbalance, validmind.data_validation.Duplicates, validmind.data_validation.HighCardinality, validmind.data_validation.HighPearsonCorrelation, validmind.data_validation.MissingValues, validmind.data_validation.Skewness, validmind.data_validation.UniqueRows, validmind.data_validation.TooManyZeroValues
text_data_qualityTextDataQualityTest suite for data quality on text datavalidmind.data_validation.ClassImbalance, validmind.data_validation.Duplicates, validmind.data_validation.nlp.StopWords, validmind.data_validation.nlp.Punctuations, validmind.data_validation.nlp.CommonWords, validmind.data_validation.nlp.TextDescription
time_series_data_qualityTimeSeriesDataQualityTest suite for data quality on time series datasetsvalidmind.data_validation.TimeSeriesOutliers, validmind.data_validation.TimeSeriesMissingValues, validmind.data_validation.TimeSeriesFrequency
time_series_datasetTimeSeriesDatasetTest suite for time series datasets.validmind.data_validation.TimeSeriesOutliers, validmind.data_validation.TimeSeriesMissingValues, validmind.data_validation.TimeSeriesFrequency, validmind.data_validation.TimeSeriesLinePlot, validmind.data_validation.TimeSeriesHistogram, validmind.data_validation.ACFandPACFPlot, validmind.data_validation.SeasonalDecompose, validmind.data_validation.AutoSeasonality, validmind.data_validation.AutoStationarity, validmind.data_validation.RollingStatsPlot, validmind.data_validation.AutoAR, validmind.data_validation.AutoMA, validmind.data_validation.ScatterPlot, validmind.data_validation.LaggedCorrelationHeatmap, validmind.data_validation.EngleGrangerCoint, validmind.data_validation.SpreadPlot
time_series_model_validationTimeSeriesModelValidationTest suite for time series model validation.validmind.data_validation.DatasetSplit, validmind.model_validation.ModelMetadata, validmind.model_validation.statsmodels.RegressionModelCoeffs, validmind.model_validation.sklearn.RegressionModelsPerformanceComparison
time_series_multivariateTimeSeriesMultivariateThis test suite provides a preliminary understanding of the features\n", - "and relationship in multivariate dataset. It presents various\n", - "multivariate visualizations that can help identify patterns, trends,\n", - "and relationships between pairs of variables. The visualizations are\n", - "designed to explore the relationships between multiple features\n", - "simultaneously. They allow you to quickly identify any patterns or\n", - "trends in the data, as well as any potential outliers or anomalies.\n", - "The individual feature distribution can also be explored to provide\n", - "insight into the range and frequency of values observed in the data.\n", - "This multivariate analysis test suite aims to provide an overview of\n", - "the data structure and guide further exploration and modeling.validmind.data_validation.ScatterPlot, validmind.data_validation.LaggedCorrelationHeatmap, validmind.data_validation.EngleGrangerCoint, validmind.data_validation.SpreadPlot
time_series_univariateTimeSeriesUnivariateThis test suite provides a preliminary understanding of the target variable(s)\n", - "used in the time series dataset. It visualizations that present the raw time\n", - "series data and a histogram of the target variable(s).\n", - "\n", - "The raw time series data provides a visual inspection of the target variable's\n", - "behavior over time. This helps to identify any patterns or trends in the data,\n", - "as well as any potential outliers or anomalies. The histogram of the target\n", - "variable displays the distribution of values, providing insight into the range\n", - "and frequency of values observed in the data.validmind.data_validation.TimeSeriesLinePlot, validmind.data_validation.TimeSeriesHistogram, validmind.data_validation.ACFandPACFPlot, validmind.data_validation.SeasonalDecompose, validmind.data_validation.AutoSeasonality, validmind.data_validation.AutoStationarity, validmind.data_validation.RollingStatsPlot, validmind.data_validation.AutoAR, validmind.data_validation.AutoMA
\n" - ], - "text/plain": [ - "" - ] - }, - "execution_count": 3, - "metadata": {}, - "output_type": "execute_result" - } + "data": { + "text/html": [ + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
IDNameDescriptionTests
classifier_model_diagnosisClassifierDiagnosisTest suite for sklearn classifier model diagnosis testsvalidmind.model_validation.sklearn.OverfitDiagnosis, validmind.model_validation.sklearn.WeakspotsDiagnosis, validmind.model_validation.sklearn.RobustnessDiagnosis
classifier_full_suiteClassifierFullSuiteFull test suite for binary classification models.validmind.data_validation.DatasetDescription, validmind.data_validation.DescriptiveStatistics, validmind.data_validation.PearsonCorrelationMatrix, validmind.data_validation.ClassImbalance, validmind.data_validation.Duplicates, validmind.data_validation.HighCardinality, validmind.data_validation.HighPearsonCorrelation, validmind.data_validation.MissingValues, validmind.data_validation.Skewness, validmind.data_validation.UniqueRows, validmind.data_validation.TooManyZeroValues, validmind.model_validation.ModelMetadata, validmind.data_validation.DatasetSplit, validmind.model_validation.sklearn.ConfusionMatrix, validmind.model_validation.sklearn.ClassifierPerformance, validmind.model_validation.sklearn.PermutationFeatureImportance, validmind.model_validation.sklearn.PrecisionRecallCurve, validmind.model_validation.sklearn.ROCCurve, validmind.model_validation.sklearn.PopulationStabilityIndex, validmind.model_validation.sklearn.SHAPGlobalImportance, validmind.model_validation.sklearn.MinimumAccuracy, validmind.model_validation.sklearn.MinimumF1Score, validmind.model_validation.sklearn.MinimumROCAUCScore, validmind.model_validation.sklearn.TrainingTestDegradation, validmind.model_validation.sklearn.ModelsPerformanceComparison, validmind.model_validation.sklearn.OverfitDiagnosis, validmind.model_validation.sklearn.WeakspotsDiagnosis, validmind.model_validation.sklearn.RobustnessDiagnosis
classifier_metricsClassifierMetricsTest suite for sklearn classifier metricsvalidmind.model_validation.ModelMetadata, validmind.data_validation.DatasetSplit, validmind.model_validation.sklearn.ConfusionMatrix, validmind.model_validation.sklearn.ClassifierPerformance, validmind.model_validation.sklearn.PermutationFeatureImportance, validmind.model_validation.sklearn.PrecisionRecallCurve, validmind.model_validation.sklearn.ROCCurve, validmind.model_validation.sklearn.PopulationStabilityIndex, validmind.model_validation.sklearn.SHAPGlobalImportance
classifier_model_validationClassifierModelValidationTest suite for binary classification models.validmind.model_validation.ModelMetadata, validmind.data_validation.DatasetSplit, validmind.model_validation.sklearn.ConfusionMatrix, validmind.model_validation.sklearn.ClassifierPerformance, validmind.model_validation.sklearn.PermutationFeatureImportance, validmind.model_validation.sklearn.PrecisionRecallCurve, validmind.model_validation.sklearn.ROCCurve, validmind.model_validation.sklearn.PopulationStabilityIndex, validmind.model_validation.sklearn.SHAPGlobalImportance, validmind.model_validation.sklearn.MinimumAccuracy, validmind.model_validation.sklearn.MinimumF1Score, validmind.model_validation.sklearn.MinimumROCAUCScore, validmind.model_validation.sklearn.TrainingTestDegradation, validmind.model_validation.sklearn.ModelsPerformanceComparison, validmind.model_validation.sklearn.OverfitDiagnosis, validmind.model_validation.sklearn.WeakspotsDiagnosis, validmind.model_validation.sklearn.RobustnessDiagnosis
classifier_validationClassifierPerformanceTest suite for sklearn classifier modelsvalidmind.model_validation.sklearn.MinimumAccuracy, validmind.model_validation.sklearn.MinimumF1Score, validmind.model_validation.sklearn.MinimumROCAUCScore, validmind.model_validation.sklearn.TrainingTestDegradation, validmind.model_validation.sklearn.ModelsPerformanceComparison
cluster_full_suiteClusterFullSuiteFull test suite for clustering models.validmind.model_validation.ModelMetadata, validmind.data_validation.DatasetSplit, validmind.model_validation.sklearn.HomogeneityScore, validmind.model_validation.sklearn.CompletenessScore, validmind.model_validation.sklearn.VMeasure, validmind.model_validation.sklearn.AdjustedRandIndex, validmind.model_validation.sklearn.AdjustedMutualInformation, validmind.model_validation.sklearn.FowlkesMallowsScore, validmind.model_validation.sklearn.ClusterPerformanceMetrics, validmind.model_validation.sklearn.ClusterCosineSimilarity, validmind.model_validation.sklearn.SilhouettePlot, validmind.model_validation.ClusterSizeDistribution, validmind.model_validation.sklearn.HyperParametersTuning, validmind.model_validation.sklearn.KMeansClustersOptimization
cluster_metricsClusterMetricsTest suite for sklearn clustering metricsvalidmind.model_validation.ModelMetadata, validmind.data_validation.DatasetSplit, validmind.model_validation.sklearn.HomogeneityScore, validmind.model_validation.sklearn.CompletenessScore, validmind.model_validation.sklearn.VMeasure, validmind.model_validation.sklearn.AdjustedRandIndex, validmind.model_validation.sklearn.AdjustedMutualInformation, validmind.model_validation.sklearn.FowlkesMallowsScore, validmind.model_validation.sklearn.ClusterPerformanceMetrics, validmind.model_validation.sklearn.ClusterCosineSimilarity, validmind.model_validation.sklearn.SilhouettePlot
cluster_performanceClusterPerformanceTest suite for sklearn cluster performancevalidmind.model_validation.ClusterSizeDistribution
embeddings_full_suiteEmbeddingsFullSuiteFull test suite for embeddings models.validmind.model_validation.ModelMetadata, validmind.data_validation.DatasetSplit, validmind.model_validation.embeddings.DescriptiveAnalytics, validmind.model_validation.embeddings.CosineSimilarityDistribution, validmind.model_validation.embeddings.ClusterDistribution, validmind.model_validation.embeddings.EmbeddingsVisualization2D, validmind.model_validation.embeddings.StabilityAnalysisRandomNoise, validmind.model_validation.embeddings.StabilityAnalysisSynonyms, validmind.model_validation.embeddings.StabilityAnalysisKeyword, validmind.model_validation.embeddings.StabilityAnalysisTranslation
embeddings_metricsEmbeddingsMetricsTest suite for embeddings metricsvalidmind.model_validation.ModelMetadata, validmind.data_validation.DatasetSplit, validmind.model_validation.embeddings.DescriptiveAnalytics, validmind.model_validation.embeddings.CosineSimilarityDistribution, validmind.model_validation.embeddings.ClusterDistribution, validmind.model_validation.embeddings.EmbeddingsVisualization2D
embeddings_model_performanceEmbeddingsPerformanceTest suite for embeddings model performancevalidmind.model_validation.embeddings.StabilityAnalysisRandomNoise, validmind.model_validation.embeddings.StabilityAnalysisSynonyms, validmind.model_validation.embeddings.StabilityAnalysisKeyword, validmind.model_validation.embeddings.StabilityAnalysisTranslation
hyper_parameters_optimizationKmeansParametersOptimizationTest suite for sklearn hyperparameters optimizationvalidmind.model_validation.sklearn.HyperParametersTuning, validmind.model_validation.sklearn.KMeansClustersOptimization
llm_classifier_full_suiteLLMClassifierFullSuiteFull test suite for LLM classification models.validmind.data_validation.ClassImbalance, validmind.data_validation.Duplicates, validmind.data_validation.nlp.StopWords, validmind.data_validation.nlp.Punctuations, validmind.data_validation.nlp.CommonWords, validmind.data_validation.nlp.TextDescription, validmind.model_validation.ModelMetadata, validmind.data_validation.DatasetSplit, validmind.model_validation.sklearn.ConfusionMatrix, validmind.model_validation.sklearn.ClassifierPerformance, validmind.model_validation.sklearn.PermutationFeatureImportance, validmind.model_validation.sklearn.PrecisionRecallCurve, validmind.model_validation.sklearn.ROCCurve, validmind.model_validation.sklearn.PopulationStabilityIndex, validmind.model_validation.sklearn.SHAPGlobalImportance, validmind.model_validation.sklearn.MinimumAccuracy, validmind.model_validation.sklearn.MinimumF1Score, validmind.model_validation.sklearn.MinimumROCAUCScore, validmind.model_validation.sklearn.TrainingTestDegradation, validmind.model_validation.sklearn.ModelsPerformanceComparison, validmind.model_validation.sklearn.OverfitDiagnosis, validmind.model_validation.sklearn.WeakspotsDiagnosis, validmind.model_validation.sklearn.RobustnessDiagnosis, validmind.prompt_validation.Bias, validmind.prompt_validation.Clarity, validmind.prompt_validation.Conciseness, validmind.prompt_validation.Delimitation, validmind.prompt_validation.NegativeInstruction, validmind.prompt_validation.Robustness, validmind.prompt_validation.Specificity
prompt_validationPromptValidationTest suite for prompt validationvalidmind.prompt_validation.Bias, validmind.prompt_validation.Clarity, validmind.prompt_validation.Conciseness, validmind.prompt_validation.Delimitation, validmind.prompt_validation.NegativeInstruction, validmind.prompt_validation.Robustness, validmind.prompt_validation.Specificity
nlp_classifier_full_suiteNLPClassifierFullSuiteFull test suite for NLP classification models.validmind.data_validation.ClassImbalance, validmind.data_validation.Duplicates, validmind.data_validation.nlp.StopWords, validmind.data_validation.nlp.Punctuations, validmind.data_validation.nlp.CommonWords, validmind.data_validation.nlp.TextDescription, validmind.model_validation.ModelMetadata, validmind.data_validation.DatasetSplit, validmind.model_validation.sklearn.ConfusionMatrix, validmind.model_validation.sklearn.ClassifierPerformance, validmind.model_validation.sklearn.PermutationFeatureImportance, validmind.model_validation.sklearn.PrecisionRecallCurve, validmind.model_validation.sklearn.ROCCurve, validmind.model_validation.sklearn.PopulationStabilityIndex, validmind.model_validation.sklearn.SHAPGlobalImportance, validmind.model_validation.sklearn.MinimumAccuracy, validmind.model_validation.sklearn.MinimumF1Score, validmind.model_validation.sklearn.MinimumROCAUCScore, validmind.model_validation.sklearn.TrainingTestDegradation, validmind.model_validation.sklearn.ModelsPerformanceComparison, validmind.model_validation.sklearn.OverfitDiagnosis, validmind.model_validation.sklearn.WeakspotsDiagnosis, validmind.model_validation.sklearn.RobustnessDiagnosis
regression_metricsRegressionMetricsTest suite for performance metrics of regression metricsvalidmind.data_validation.DatasetSplit, validmind.model_validation.ModelMetadata, validmind.model_validation.sklearn.PermutationFeatureImportance
regression_model_descriptionRegressionModelDescriptionTest suite for performance metric of regression model of statsmodels libraryvalidmind.data_validation.DatasetSplit, validmind.model_validation.ModelMetadata
regression_models_evaluationRegressionModelsEvaluationTest suite for metrics comparison of regression model of statsmodels libraryvalidmind.model_validation.statsmodels.RegressionModelCoeffs, validmind.model_validation.sklearn.RegressionModelsPerformanceComparison
regression_full_suiteRegressionFullSuiteFull test suite for regression models.validmind.data_validation.DatasetDescription, validmind.data_validation.DescriptiveStatistics, validmind.data_validation.PearsonCorrelationMatrix, validmind.data_validation.ClassImbalance, validmind.data_validation.Duplicates, validmind.data_validation.HighCardinality, validmind.data_validation.HighPearsonCorrelation, validmind.data_validation.MissingValues, validmind.data_validation.Skewness, validmind.data_validation.UniqueRows, validmind.data_validation.TooManyZeroValues, validmind.data_validation.DatasetSplit, validmind.model_validation.ModelMetadata, validmind.model_validation.sklearn.PermutationFeatureImportance, validmind.model_validation.sklearn.RegressionErrors, validmind.model_validation.sklearn.RegressionR2Square
regression_performanceRegressionPerformanceTest suite for regression model performancevalidmind.model_validation.sklearn.RegressionErrors, validmind.model_validation.sklearn.RegressionR2Square
summarization_metricsSummarizationMetricsTest suite for Summarization metricsvalidmind.model_validation.TokenDisparity, validmind.model_validation.BleuScore, validmind.model_validation.BertScore, validmind.model_validation.ContextualRecall
tabular_datasetTabularDatasetTest suite for tabular datasets.validmind.data_validation.DatasetDescription, validmind.data_validation.DescriptiveStatistics, validmind.data_validation.PearsonCorrelationMatrix, validmind.data_validation.ClassImbalance, validmind.data_validation.Duplicates, validmind.data_validation.HighCardinality, validmind.data_validation.HighPearsonCorrelation, validmind.data_validation.MissingValues, validmind.data_validation.Skewness, validmind.data_validation.UniqueRows, validmind.data_validation.TooManyZeroValues
tabular_dataset_descriptionTabularDatasetDescriptionTest suite to extract metadata and descriptive\n", + "statistics from a tabular datasetvalidmind.data_validation.DatasetDescription, validmind.data_validation.DescriptiveStatistics, validmind.data_validation.PearsonCorrelationMatrix
tabular_data_qualityTabularDataQualityTest suite for data quality on tabular datasetsvalidmind.data_validation.ClassImbalance, validmind.data_validation.Duplicates, validmind.data_validation.HighCardinality, validmind.data_validation.HighPearsonCorrelation, validmind.data_validation.MissingValues, validmind.data_validation.Skewness, validmind.data_validation.UniqueRows, validmind.data_validation.TooManyZeroValues
text_data_qualityTextDataQualityTest suite for data quality on text datavalidmind.data_validation.ClassImbalance, validmind.data_validation.Duplicates, validmind.data_validation.nlp.StopWords, validmind.data_validation.nlp.Punctuations, validmind.data_validation.nlp.CommonWords, validmind.data_validation.nlp.TextDescription
time_series_data_qualityTimeSeriesDataQualityTest suite for data quality on time series datasetsvalidmind.data_validation.TimeSeriesOutliers, validmind.data_validation.TimeSeriesMissingValues, validmind.data_validation.TimeSeriesFrequency
time_series_datasetTimeSeriesDatasetTest suite for time series datasets.validmind.data_validation.TimeSeriesOutliers, validmind.data_validation.TimeSeriesMissingValues, validmind.data_validation.TimeSeriesFrequency, validmind.data_validation.TimeSeriesLinePlot, validmind.data_validation.TimeSeriesHistogram, validmind.data_validation.ACFandPACFPlot, validmind.data_validation.SeasonalDecompose, validmind.data_validation.AutoSeasonality, validmind.data_validation.AutoStationarity, validmind.data_validation.RollingStatsPlot, validmind.data_validation.AutoAR, validmind.data_validation.AutoMA, validmind.data_validation.ScatterPlot, validmind.data_validation.LaggedCorrelationHeatmap, validmind.data_validation.EngleGrangerCoint, validmind.data_validation.SpreadPlot
time_series_model_validationTimeSeriesModelValidationTest suite for time series model validation.validmind.data_validation.DatasetSplit, validmind.model_validation.ModelMetadata, validmind.model_validation.statsmodels.RegressionModelCoeffs, validmind.model_validation.sklearn.RegressionModelsPerformanceComparison
time_series_multivariateTimeSeriesMultivariateThis test suite provides a preliminary understanding of the features\n", + "and relationship in multivariate dataset. It presents various\n", + "multivariate visualizations that can help identify patterns, trends,\n", + "and relationships between pairs of variables. The visualizations are\n", + "designed to explore the relationships between multiple features\n", + "simultaneously. They allow you to quickly identify any patterns or\n", + "trends in the data, as well as any potential outliers or anomalies.\n", + "The individual feature distribution can also be explored to provide\n", + "insight into the range and frequency of values observed in the data.\n", + "This multivariate analysis test suite aims to provide an overview of\n", + "the data structure and guide further exploration and modeling.validmind.data_validation.ScatterPlot, validmind.data_validation.LaggedCorrelationHeatmap, validmind.data_validation.EngleGrangerCoint, validmind.data_validation.SpreadPlot
time_series_univariateTimeSeriesUnivariateThis test suite provides a preliminary understanding of the target variable(s)\n", + "used in the time series dataset. It visualizations that present the raw time\n", + "series data and a histogram of the target variable(s).\n", + "\n", + "The raw time series data provides a visual inspection of the target variable's\n", + "behavior over time. This helps to identify any patterns or trends in the data,\n", + "as well as any potential outliers or anomalies. The histogram of the target\n", + "variable displays the distribution of values, providing insight into the range\n", + "and frequency of values observed in the data.validmind.data_validation.TimeSeriesLinePlot, validmind.data_validation.TimeSeriesHistogram, validmind.data_validation.ACFandPACFPlot, validmind.data_validation.SeasonalDecompose, validmind.data_validation.AutoSeasonality, validmind.data_validation.AutoStationarity, validmind.data_validation.RollingStatsPlot, validmind.data_validation.AutoAR, validmind.data_validation.AutoMA
\n" ], - "source": [ - "import validmind as vm\n", - "\n", - "vm.test_suites.list_suites()" + "text/plain": [ + "" ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "## View test suite details\n", - "\n", - "Use the [test_suites.describe_suite()](https://docs.validmind.ai/validmind/validmind/test_suites.html#describe_suite) function to retrieve information about a test suite, including its name, description, and the list of tests it contains. \n", - "\n", - "You can call `test_suites.describe_suite()` with just the test suite ID to get basic details, or pass an additional `verbose` parameter for a more comprehensive output: \n", - "\n", - "- **Test ID** - The identifier of the test suite you want to inspect.\n", - "- **Verbose** - A Boolean flag. Set `verbose=True` to return a full breakdown of the test suite." - ] - }, + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import validmind as vm\n", + "\n", + "vm.test_suites.list_suites()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "## View test suite details\n", + "\n", + "Use the [test_suites.describe_suite()](https://docs.validmind.ai/validmind/validmind/test_suites.html#describe_suite) function to retrieve information about a test suite, including its name, description, and the list of tests it contains. \n", + "\n", + "You can call `test_suites.describe_suite()` with just the test suite ID to get basic details, or pass an additional `verbose` parameter for a more comprehensive output: \n", + "\n", + "- **Test ID** - The identifier of the test suite you want to inspect.\n", + "- **Verbose** - A Boolean flag. Set `verbose=True` to return a full breakdown of the test suite." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
Test Suite IDTest Suite NameTest Suite SectionTest IDTest Name
classifier_full_suiteClassifierFullSuitetabular_dataset_descriptionvalidmind.data_validation.DatasetDescriptionDataset Description
classifier_full_suiteClassifierFullSuitetabular_dataset_descriptionvalidmind.data_validation.DescriptiveStatisticsDescriptive Statistics
classifier_full_suiteClassifierFullSuitetabular_dataset_descriptionvalidmind.data_validation.PearsonCorrelationMatrixPearson Correlation Matrix
classifier_full_suiteClassifierFullSuitetabular_data_qualityvalidmind.data_validation.ClassImbalanceClass Imbalance
classifier_full_suiteClassifierFullSuitetabular_data_qualityvalidmind.data_validation.DuplicatesDuplicates
classifier_full_suiteClassifierFullSuitetabular_data_qualityvalidmind.data_validation.HighCardinalityHigh Cardinality
classifier_full_suiteClassifierFullSuitetabular_data_qualityvalidmind.data_validation.HighPearsonCorrelationHigh Pearson Correlation
classifier_full_suiteClassifierFullSuitetabular_data_qualityvalidmind.data_validation.MissingValuesMissing Values
classifier_full_suiteClassifierFullSuitetabular_data_qualityvalidmind.data_validation.SkewnessSkewness
classifier_full_suiteClassifierFullSuitetabular_data_qualityvalidmind.data_validation.UniqueRowsUnique Rows
classifier_full_suiteClassifierFullSuitetabular_data_qualityvalidmind.data_validation.TooManyZeroValuesToo Many Zero Values
classifier_full_suiteClassifierFullSuiteclassifier_metricsvalidmind.model_validation.ModelMetadataModel Metadata
classifier_full_suiteClassifierFullSuiteclassifier_metricsvalidmind.data_validation.DatasetSplitDataset Split
classifier_full_suiteClassifierFullSuiteclassifier_metricsvalidmind.model_validation.sklearn.ConfusionMatrixConfusion Matrix
classifier_full_suiteClassifierFullSuiteclassifier_metricsvalidmind.model_validation.sklearn.ClassifierPerformanceClassifier Performance
classifier_full_suiteClassifierFullSuiteclassifier_metricsvalidmind.model_validation.sklearn.PermutationFeatureImportancePermutation Feature Importance
classifier_full_suiteClassifierFullSuiteclassifier_metricsvalidmind.model_validation.sklearn.PrecisionRecallCurvePrecision Recall Curve
classifier_full_suiteClassifierFullSuiteclassifier_metricsvalidmind.model_validation.sklearn.ROCCurveROC Curve
classifier_full_suiteClassifierFullSuiteclassifier_metricsvalidmind.model_validation.sklearn.PopulationStabilityIndexPopulation Stability Index
classifier_full_suiteClassifierFullSuiteclassifier_metricsvalidmind.model_validation.sklearn.SHAPGlobalImportanceSHAP Global Importance
classifier_full_suiteClassifierFullSuiteclassifier_validationvalidmind.model_validation.sklearn.MinimumAccuracyMinimum Accuracy
classifier_full_suiteClassifierFullSuiteclassifier_validationvalidmind.model_validation.sklearn.MinimumF1ScoreMinimum F1 Score
classifier_full_suiteClassifierFullSuiteclassifier_validationvalidmind.model_validation.sklearn.MinimumROCAUCScoreMinimum ROCAUC Score
classifier_full_suiteClassifierFullSuiteclassifier_validationvalidmind.model_validation.sklearn.TrainingTestDegradationTraining Test Degradation
classifier_full_suiteClassifierFullSuiteclassifier_validationvalidmind.model_validation.sklearn.ModelsPerformanceComparisonModels Performance Comparison
classifier_full_suiteClassifierFullSuiteclassifier_model_diagnosisvalidmind.model_validation.sklearn.OverfitDiagnosisOverfit Diagnosis
classifier_full_suiteClassifierFullSuiteclassifier_model_diagnosisvalidmind.model_validation.sklearn.WeakspotsDiagnosisWeakspots Diagnosis
classifier_full_suiteClassifierFullSuiteclassifier_model_diagnosisvalidmind.model_validation.sklearn.RobustnessDiagnosisRobustness Diagnosis
\n" - ], - "text/plain": [ - "" - ] - }, - "execution_count": 6, - "metadata": {}, - "output_type": "execute_result" - } + "data": { + "text/html": [ + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Test Suite IDTest Suite NameTest Suite SectionTest IDTest Name
classifier_full_suiteClassifierFullSuitetabular_dataset_descriptionvalidmind.data_validation.DatasetDescriptionDataset Description
classifier_full_suiteClassifierFullSuitetabular_dataset_descriptionvalidmind.data_validation.DescriptiveStatisticsDescriptive Statistics
classifier_full_suiteClassifierFullSuitetabular_dataset_descriptionvalidmind.data_validation.PearsonCorrelationMatrixPearson Correlation Matrix
classifier_full_suiteClassifierFullSuitetabular_data_qualityvalidmind.data_validation.ClassImbalanceClass Imbalance
classifier_full_suiteClassifierFullSuitetabular_data_qualityvalidmind.data_validation.DuplicatesDuplicates
classifier_full_suiteClassifierFullSuitetabular_data_qualityvalidmind.data_validation.HighCardinalityHigh Cardinality
classifier_full_suiteClassifierFullSuitetabular_data_qualityvalidmind.data_validation.HighPearsonCorrelationHigh Pearson Correlation
classifier_full_suiteClassifierFullSuitetabular_data_qualityvalidmind.data_validation.MissingValuesMissing Values
classifier_full_suiteClassifierFullSuitetabular_data_qualityvalidmind.data_validation.SkewnessSkewness
classifier_full_suiteClassifierFullSuitetabular_data_qualityvalidmind.data_validation.UniqueRowsUnique Rows
classifier_full_suiteClassifierFullSuitetabular_data_qualityvalidmind.data_validation.TooManyZeroValuesToo Many Zero Values
classifier_full_suiteClassifierFullSuiteclassifier_metricsvalidmind.model_validation.ModelMetadataModel Metadata
classifier_full_suiteClassifierFullSuiteclassifier_metricsvalidmind.data_validation.DatasetSplitDataset Split
classifier_full_suiteClassifierFullSuiteclassifier_metricsvalidmind.model_validation.sklearn.ConfusionMatrixConfusion Matrix
classifier_full_suiteClassifierFullSuiteclassifier_metricsvalidmind.model_validation.sklearn.ClassifierPerformanceClassifier Performance
classifier_full_suiteClassifierFullSuiteclassifier_metricsvalidmind.model_validation.sklearn.PermutationFeatureImportancePermutation Feature Importance
classifier_full_suiteClassifierFullSuiteclassifier_metricsvalidmind.model_validation.sklearn.PrecisionRecallCurvePrecision Recall Curve
classifier_full_suiteClassifierFullSuiteclassifier_metricsvalidmind.model_validation.sklearn.ROCCurveROC Curve
classifier_full_suiteClassifierFullSuiteclassifier_metricsvalidmind.model_validation.sklearn.PopulationStabilityIndexPopulation Stability Index
classifier_full_suiteClassifierFullSuiteclassifier_metricsvalidmind.model_validation.sklearn.SHAPGlobalImportanceSHAP Global Importance
classifier_full_suiteClassifierFullSuiteclassifier_validationvalidmind.model_validation.sklearn.MinimumAccuracyMinimum Accuracy
classifier_full_suiteClassifierFullSuiteclassifier_validationvalidmind.model_validation.sklearn.MinimumF1ScoreMinimum F1 Score
classifier_full_suiteClassifierFullSuiteclassifier_validationvalidmind.model_validation.sklearn.MinimumROCAUCScoreMinimum ROCAUC Score
classifier_full_suiteClassifierFullSuiteclassifier_validationvalidmind.model_validation.sklearn.TrainingTestDegradationTraining Test Degradation
classifier_full_suiteClassifierFullSuiteclassifier_validationvalidmind.model_validation.sklearn.ModelsPerformanceComparisonModels Performance Comparison
classifier_full_suiteClassifierFullSuiteclassifier_model_diagnosisvalidmind.model_validation.sklearn.OverfitDiagnosisOverfit Diagnosis
classifier_full_suiteClassifierFullSuiteclassifier_model_diagnosisvalidmind.model_validation.sklearn.WeakspotsDiagnosisWeakspots Diagnosis
classifier_full_suiteClassifierFullSuiteclassifier_model_diagnosisvalidmind.model_validation.sklearn.RobustnessDiagnosisRobustness Diagnosis
\n" ], - "source": [ - "vm.test_suites.describe_suite(\"classifier_full_suite\", verbose=True)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "### View test details\n", - "\n", - "To inspect a specific test in a suite, pass the name of the test to [tests.describe_test()](https://docs.validmind.ai/validmind/validmind/tests.html#describe_test) to get detailed information about the test such as its purpose, strengths and limitations:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "vm.tests.describe_test(\"validmind.data_validation.DescriptiveStatistics\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\"Screenshot\n", - "\"Screenshot" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "## Next steps\n", - "\n", - "Now that you’ve learned how to identify ValidMind test suites relevant to your use cases, we encourage you to explore our interactive notebooks to discover additional tests, learn how to run them, and effectively document your models.\n", - "\n", - "
Learn more about the individual tests available in the ValidMind Library\n", - "

\n", - "Check out our Explore tests notebook for more code examples and usage of key functions.
\n", - "\n", - "\n", - "\n", - "### Discover more learning resources\n", - "\n", - "We offer many interactive notebooks to help you automate testing, documenting, validating, and more:\n", - "\n", - "- [Run tests & test suites](https://docs.validmind.ai/developer/how-to/testing-overview.html)\n", - "- [Use ValidMind Library features](https://docs.validmind.ai/developer/how-to/feature-overview.html)\n", - "- [Code samples by use case](https://docs.validmind.ai/guide/samples-jupyter-notebooks.html)\n", - "\n", - "Or, visit our [documentation](https://docs.validmind.ai/) to learn more about ValidMind." + "text/plain": [ + "" ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "## Upgrade ValidMind\n", - "\n", - "
After installing ValidMind, you'll want to periodically make sure you are on the latest version to access any new features and other enhancements.
\n", - "\n", - "Retrieve the information for the currently installed version of ValidMind:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%pip show validmind" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "If the version returned is lower than the version indicated in our [production open-source code](https://github.com/validmind/validmind-library/blob/prod/validmind/__version__.py), restart your notebook and run:\n", - "\n", - "```bash\n", - "%pip install --upgrade validmind\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You may need to restart your kernel after running the upgrade package for changes to be applied." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "\n", - "\n", - "***\n", - "\n", - "Copyright © 2023-2026 ValidMind Inc. All rights reserved.
\n", - "Refer to [LICENSE](https://github.com/validmind/validmind-library/blob/main/LICENSE) for details.
\n", - "SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial
" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "ValidMind Library", - "language": "python", - "name": "validmind" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.10.13" + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" } + ], + "source": [ + "vm.test_suites.describe_suite(\"classifier_full_suite\", verbose=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "### View test details\n", + "\n", + "To inspect a specific test in a suite, pass the name of the test to [tests.describe_test()](https://docs.validmind.ai/validmind/validmind/tests.html#describe_test) to get detailed information about the test such as its purpose, strengths and limitations:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "vm.tests.describe_test(\"validmind.data_validation.DescriptiveStatistics\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\"Screenshot\n", + "\"Screenshot" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "## Next steps\n", + "\n", + "Now that you’ve learned how to identify ValidMind test suites relevant to your use cases, we encourage you to explore our interactive notebooks to discover additional tests, learn how to run them, and effectively document your models.\n", + "\n", + "
Learn more about the individual tests available in the ValidMind Library\n", + "

\n", + "Check out our Explore tests notebook for more code examples and usage of key functions.
\n", + "\n", + "\n", + "\n", + "### Discover more learning resources\n", + "\n", + "We offer many interactive notebooks to help you automate testing, documenting, validating, and more:\n", + "\n", + "- [Run tests & test suites](https://docs.validmind.ai/developer/how-to/testing-overview.html)\n", + "- [Use ValidMind Library features](https://docs.validmind.ai/developer/how-to/feature-overview.html)\n", + "- [Code samples by use case](https://docs.validmind.ai/guide/samples-jupyter-notebooks.html)\n", + "\n", + "Or, visit our [documentation](https://docs.validmind.ai/) to learn more about ValidMind." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "## Upgrade ValidMind\n", + "\n", + "
After installing ValidMind, you'll want to periodically make sure you are on the latest version to access any new features and other enhancements.
\n", + "\n", + "Retrieve the information for the currently installed version of ValidMind:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%pip show validmind" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If the version returned is lower than the version indicated in our [production open-source code](https://github.com/validmind/validmind-library/blob/prod/validmind/__version__.py), restart your notebook and run:\n", + "\n", + "```bash\n", + "%pip install --upgrade validmind\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You may need to restart your kernel after running the upgrade package for changes to be applied." + ] + }, + { + "cell_type": "markdown", + "id": "copyright-a3ad64253d204629b8f2e773414c6aeb", + "metadata": {}, + "source": [ + "\n", + "\n", + "\n", + "\n", + "***\n", + "\n", + "Copyright © 2023-2026 ValidMind Inc. All rights reserved.
\n", + "Refer to [LICENSE](https://github.com/validmind/validmind-library/blob/main/LICENSE) for details.
\n", + "SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial
" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "ValidMind Library", + "language": "python", + "name": "validmind" }, - "nbformat": 4, - "nbformat_minor": 2 + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.13" + } + }, + "nbformat": 4, + "nbformat_minor": 2 } diff --git a/notebooks/how_to/tests/explore_tests/explore_tests.ipynb b/notebooks/how_to/tests/explore_tests/explore_tests.ipynb index ebc3323e5..629cd0932 100644 --- a/notebooks/how_to/tests/explore_tests/explore_tests.ipynb +++ b/notebooks/how_to/tests/explore_tests/explore_tests.ipynb @@ -1,4462 +1,4463 @@ { - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Explore tests\n", - "\n", - "Explore the individual out-the-box tests available in the ValidMind Library, and identify which tests to run to evaluate different aspects of your model. Browse available tests, view their descriptions, and filter by tags or task type to find tests relevant to your use case." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "::: {.content-hidden when-format=\"html\"}\n", - "## Contents \n", - "- [About ValidMind](#toc1__) \n", - " - [Before you begin](#toc1_1__) \n", - " - [New to ValidMind?](#toc1_2__) \n", - " - [Key concepts](#toc1_3__) \n", - "- [Install the ValidMind Library](#toc2__) \n", - "- [List all available tests](#toc3__) \n", - "- [Understand tags and task types](#toc4__) \n", - "- [Filter tests by tags and task types](#toc5__) \n", - "- [Store test sets for use](#toc6__) \n", - "- [Next steps](#toc7__) \n", - " - [Discover more learning resources](#toc7_1__) \n", - "- [Upgrade ValidMind](#toc8__) \n", - "\n", - ":::\n", - "\n", - "" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "## About ValidMind\n", - "\n", - "ValidMind is a suite of tools for managing model risk, including risk associated with AI and statistical models.\n", - "\n", - "You use the ValidMind Library to automate documentation and validation tests, and then use the ValidMind Platform to collaborate on model documentation. Together, these products simplify model risk management, facilitate compliance with regulations and institutional standards, and enhance collaboration between yourself and model validators.\n", - "\n", - "\n", - "\n", - "### Before you begin\n", - "\n", - "This notebook assumes you have basic familiarity with Python, including an understanding of how functions work. If you are new to Python, you can still run the notebook but we recommend further familiarizing yourself with the language. \n", - "\n", - "If you encounter errors due to missing modules in your Python environment, install the modules with `pip install`, and then re-run the notebook. For more help, refer to [Installing Python Modules](https://docs.python.org/3/installing/index.html).\n", - "\n", - "\n", - "\n", - "### New to ValidMind?\n", - "\n", - "If you haven't already seen our documentation on the [ValidMind Library](https://docs.validmind.ai/developer/validmind-library.html), we recommend you begin by exploring the available resources in this section. There, you can learn more about documenting models and running tests, as well as find code samples and our Python Library API reference.\n", - "\n", - "
For access to all features available in this notebook, you'll need access to a ValidMind account.\n", - "

\n", - "Register with ValidMind
\n", - "\n", - "\n", - "\n", - "### Key concepts\n", - "\n", - "**Model documentation**: A structured and detailed record pertaining to a model, encompassing key components such as its underlying assumptions, methodologies, data sources, inputs, performance metrics, evaluations, limitations, and intended uses. It serves to ensure transparency, adherence to regulatory requirements, and a clear understanding of potential risks associated with the model’s application.\n", - "\n", - "**Documentation template**: Functions as a test suite and lays out the structure of model documentation, segmented into various sections and sub-sections. Documentation templates define the structure of your model documentation, specifying the tests that should be run, and how the results should be displayed.\n", - "\n", - "**Tests**: A function contained in the ValidMind Library, designed to run a specific quantitative test on the dataset or model. Tests are the building blocks of ValidMind, used to evaluate and document models and datasets, and can be run individually or as part of a suite defined by your model documentation template.\n", - "\n", - "**Custom tests**: Custom tests are functions that you define to evaluate your model or dataset. These functions can be registered via the ValidMind Library to be used with the ValidMind Platform.\n", - "\n", - "**Inputs**: Objects to be evaluated and documented in the ValidMind Library. They can be any of the following:\n", - "\n", - " - **model**: A single model that has been initialized in ValidMind with [`vm.init_model()`](https://docs.validmind.ai/validmind/validmind.html#init_model).\n", - " - **dataset**: Single dataset that has been initialized in ValidMind with [`vm.init_dataset()`](https://docs.validmind.ai/validmind/validmind.html#init_dataset).\n", - " - **models**: A list of ValidMind models - usually this is used when you want to compare multiple models in your custom test.\n", - " - **datasets**: A list of ValidMind datasets - usually this is used when you want to compare multiple datasets in your custom test. See this [example](https://docs.validmind.ai/notebooks/how_to/tests/run_tests/configure_tests/run_tests_that_require_multiple_datasets.html) for more information.\n", - "\n", - "**Parameters**: Additional arguments that can be passed when running a ValidMind test, used to pass additional information to a test, customize its behavior, or provide additional context.\n", - "\n", - "**Outputs**: Custom tests can return elements like tables or plots. Tables may be a list of dictionaries (each representing a row) or a pandas DataFrame. Plots may be matplotlib or plotly figures.\n", - "\n", - "**Test suites**: Collections of tests designed to run together to automate and generate model documentation end-to-end for specific use-cases.\n", - "\n", - "Example: the [`classifier_full_suite`](https://docs.validmind.ai/validmind/validmind/test_suites/classifier.html#ClassifierFullSuite) test suite runs tests from the [`tabular_dataset`](https://docs.validmind.ai/validmind/validmind/test_suites/tabular_datasets.html) and [`classifier`](https://docs.validmind.ai/validmind/validmind/test_suites/classifier.html) test suites to fully document the data and model sections for binary classification model use-cases." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "## Install the ValidMind Library\n", - "\n", - "
Recommended Python versions\n", - "

\n", - "Python 3.8 <= x <= 3.11
\n", - "\n", - "To install the library:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%pip install -q validmind" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "## List all available tests" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Start by importing the functions from the [validmind.tests](https://docs.validmind.ai/validmind/validmind/tests.html) module for listing tests, listing tasks, listing tags, and listing tasks and tags to access these functions in the rest of this notebook:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from validmind.tests import (\n", - " list_tests,\n", - " list_tasks,\n", - " list_tags,\n", - " list_tasks_and_tags,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Use [list_tests()](https://docs.validmind.ai/validmind/validmind/tests.html#list_tests) to retrieve all available ValidMind tests, which returns a DataFrame with the following columns:\n", - "\n", - "- **ID** – A unique identifier for each test.\n", - "- **Name** – The test’s name.\n", - "- **Description** – A short summary of what the test evaluates.\n", - "- **Tags** – Keywords that describe what the test does or applies to.\n", - "- **Tasks** – The type of modeling task the test supports." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
IDNameDescriptionHas FigureHas TableRequired InputsParamsTagsTasks
validmind.data_validation.ACFandPACFPlotAC Fand PACF PlotAnalyzes time series data using Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) plots to...TrueFalse['dataset']{}['time_series_data', 'forecasting', 'statistical_test', 'visualization']['regression']
validmind.data_validation.ADFADFAssesses the stationarity of a time series dataset using the Augmented Dickey-Fuller (ADF) test....FalseTrue['dataset']{}['time_series_data', 'statsmodels', 'forecasting', 'statistical_test', 'stationarity']['regression']
validmind.data_validation.AutoARAuto ARAutomatically identifies the optimal Autoregressive (AR) order for a time series using BIC and AIC criteria....FalseTrue['dataset']{'max_ar_order': {'type': 'int', 'default': 3}}['time_series_data', 'statsmodels', 'forecasting', 'statistical_test']['regression']
validmind.data_validation.AutoMAAuto MAAutomatically selects the optimal Moving Average (MA) order for each variable in a time series dataset based on...FalseTrue['dataset']{'max_ma_order': {'type': 'int', 'default': 3}}['time_series_data', 'statsmodels', 'forecasting', 'statistical_test']['regression']
validmind.data_validation.AutoStationarityAuto StationarityAutomates Augmented Dickey-Fuller test to assess stationarity across multiple time series in a DataFrame....FalseTrue['dataset']{'max_order': {'type': 'int', 'default': 5}, 'threshold': {'type': 'float', 'default': 0.05}}['time_series_data', 'statsmodels', 'forecasting', 'statistical_test']['regression']
validmind.data_validation.BivariateScatterPlotsBivariate Scatter PlotsGenerates bivariate scatterplots to visually inspect relationships between pairs of numerical predictor variables...TrueFalse['dataset']{}['tabular_data', 'numerical_data', 'visualization']['classification']
validmind.data_validation.BoxPierceBox PierceDetects autocorrelation in time-series data through the Box-Pierce test to validate model performance....FalseTrue['dataset']{}['time_series_data', 'forecasting', 'statistical_test', 'statsmodels']['regression']
validmind.data_validation.ChiSquaredFeaturesTableChi Squared Features TableAssesses the statistical association between categorical features and a target variable using the Chi-Squared test....FalseTrue['dataset']{'p_threshold': {'type': '_empty', 'default': 0.05}}['tabular_data', 'categorical_data', 'statistical_test']['classification']
validmind.data_validation.ClassImbalanceClass ImbalanceEvaluates and quantifies class distribution imbalance in a dataset used by a machine learning model....TrueTrue['dataset']{'min_percent_threshold': {'type': 'int', 'default': 10}}['tabular_data', 'binary_classification', 'multiclass_classification', 'data_quality']['classification']
validmind.data_validation.DatasetDescriptionDataset DescriptionProvides comprehensive analysis and statistical summaries of each column in a machine learning model's dataset....FalseTrue['dataset']{}['tabular_data', 'time_series_data', 'text_data']['classification', 'regression', 'text_classification', 'text_summarization']
validmind.data_validation.DatasetSplitDataset SplitEvaluates and visualizes the distribution proportions among training, testing, and validation datasets of an ML...FalseTrue['datasets']{}['tabular_data', 'time_series_data', 'text_data']['classification', 'regression', 'text_classification', 'text_summarization']
validmind.data_validation.DescriptiveStatisticsDescriptive StatisticsPerforms a detailed descriptive statistical analysis of both numerical and categorical data within a model's...FalseTrue['dataset']{}['tabular_data', 'time_series_data', 'data_quality']['classification', 'regression']
validmind.data_validation.DickeyFullerGLSDickey Fuller GLSAssesses stationarity in time series data using the Dickey-Fuller GLS test to determine the order of integration....FalseTrue['dataset']{}['time_series_data', 'forecasting', 'unit_root_test']['regression']
validmind.data_validation.DuplicatesDuplicatesTests dataset for duplicate entries, ensuring model reliability via data quality verification....FalseTrue['dataset']{'min_threshold': {'type': '_empty', 'default': 1}}['tabular_data', 'data_quality', 'text_data']['classification', 'regression']
validmind.data_validation.EngleGrangerCointEngle Granger CointAssesses the degree of co-movement between pairs of time series data using the Engle-Granger cointegration test....FalseTrue['dataset']{'threshold': {'type': 'float', 'default': 0.05}}['time_series_data', 'statistical_test', 'forecasting']['regression']
validmind.data_validation.FeatureTargetCorrelationPlotFeature Target Correlation PlotVisualizes the correlation between input features and the model's target output in a color-coded horizontal bar...TrueFalse['dataset']{'fig_height': {'type': '_empty', 'default': 600}}['tabular_data', 'visualization', 'correlation']['classification', 'regression']
validmind.data_validation.HighCardinalityHigh CardinalityAssesses the number of unique values in categorical columns to detect high cardinality and potential overfitting....FalseTrue['dataset']{'num_threshold': {'type': 'int', 'default': 100}, 'percent_threshold': {'type': 'float', 'default': 0.1}, 'threshold_type': {'type': 'str', 'default': 'percent'}}['tabular_data', 'data_quality', 'categorical_data']['classification', 'regression']
validmind.data_validation.HighPearsonCorrelationHigh Pearson CorrelationIdentifies highly correlated feature pairs in a dataset suggesting feature redundancy or multicollinearity....FalseTrue['dataset']{'max_threshold': {'type': 'float', 'default': 0.3}, 'top_n_correlations': {'type': 'int', 'default': 10}, 'feature_columns': {'type': 'list', 'default': None}}['tabular_data', 'data_quality', 'correlation']['classification', 'regression']
validmind.data_validation.IQROutliersBarPlotIQR Outliers Bar PlotVisualizes outlier distribution across percentiles in numerical data using the Interquartile Range (IQR) method....TrueFalse['dataset']{'threshold': {'type': 'float', 'default': 1.5}, 'fig_width': {'type': 'int', 'default': 800}}['tabular_data', 'visualization', 'numerical_data']['classification', 'regression']
validmind.data_validation.IQROutliersTableIQR Outliers TableDetermines and summarizes outliers in numerical features using the Interquartile Range method....FalseTrue['dataset']{'threshold': {'type': 'float', 'default': 1.5}}['tabular_data', 'numerical_data']['classification', 'regression']
validmind.data_validation.IsolationForestOutliersIsolation Forest OutliersDetects outliers in a dataset using the Isolation Forest algorithm and visualizes results through scatter plots....TrueFalse['dataset']{'random_state': {'type': 'int', 'default': 0}, 'contamination': {'type': 'float', 'default': 0.1}, 'feature_columns': {'type': 'list', 'default': None}}['tabular_data', 'anomaly_detection']['classification']
validmind.data_validation.JarqueBeraJarque BeraAssesses normality of dataset features in an ML model using the Jarque-Bera test....FalseTrue['dataset']{}['tabular_data', 'data_distribution', 'statistical_test', 'statsmodels']['classification', 'regression']
validmind.data_validation.KPSSKPSSAssesses the stationarity of time-series data in a machine learning model using the KPSS unit root test....FalseTrue['dataset']{}['time_series_data', 'stationarity', 'unit_root_test', 'statsmodels']['data_validation']
validmind.data_validation.LJungBoxL Jung BoxAssesses autocorrelations in dataset features by performing a Ljung-Box test on each feature....FalseTrue['dataset']{}['time_series_data', 'forecasting', 'statistical_test', 'statsmodels']['regression']
validmind.data_validation.LaggedCorrelationHeatmapLagged Correlation HeatmapAssesses and visualizes correlation between target variable and lagged independent variables in a time-series...TrueFalse['dataset']{'num_lags': {'type': 'int', 'default': 10}}['time_series_data', 'visualization']['regression']
validmind.data_validation.MissingValuesMissing ValuesEvaluates dataset quality by ensuring missing value ratio across all features does not exceed a set threshold....FalseTrue['dataset']{'min_threshold': {'type': 'int', 'default': 1}}['tabular_data', 'data_quality']['classification', 'regression']
validmind.data_validation.MissingValuesBarPlotMissing Values Bar PlotAssesses the percentage and distribution of missing values in the dataset via a bar plot, with emphasis on...TrueFalse['dataset']{'threshold': {'type': 'int', 'default': 80}, 'fig_height': {'type': 'int', 'default': 600}}['tabular_data', 'data_quality', 'visualization']['classification', 'regression']
validmind.data_validation.MutualInformationMutual InformationCalculates mutual information scores between features and target variable to evaluate feature relevance....TrueFalse['dataset']{'min_threshold': {'type': 'float', 'default': 0.01}, 'task': {'type': 'str', 'default': 'classification'}}['feature_selection', 'data_analysis']['classification', 'regression']
validmind.data_validation.PearsonCorrelationMatrixPearson Correlation MatrixEvaluates linear dependency between numerical variables in a dataset via a Pearson Correlation coefficient heat map....TrueFalse['dataset']{}['tabular_data', 'numerical_data', 'correlation']['classification', 'regression']
validmind.data_validation.PhillipsPerronArchPhillips Perron ArchAssesses the stationarity of time series data in each feature of the ML model using the Phillips-Perron test....FalseTrue['dataset']{}['time_series_data', 'forecasting', 'statistical_test', 'unit_root_test']['regression']
validmind.data_validation.ProtectedClassesDescriptionProtected Classes DescriptionVisualizes the distribution of protected classes in the dataset relative to the target variable...TrueTrue['dataset']{'protected_classes': {'type': '_empty', 'default': None}}['bias_and_fairness', 'descriptive_statistics']['classification', 'regression']
validmind.data_validation.RollingStatsPlotRolling Stats PlotEvaluates the stationarity of time series data by plotting its rolling mean and standard deviation over a specified...TrueFalse['dataset']{'window_size': {'type': 'int', 'default': 12}}['time_series_data', 'visualization', 'stationarity']['regression']
validmind.data_validation.RunsTestRuns TestExecutes Runs Test on ML model to detect non-random patterns in output data sequence....FalseTrue['dataset']{}['tabular_data', 'statistical_test', 'statsmodels']['classification', 'regression']
validmind.data_validation.ScatterPlotScatter PlotAssesses visual relationships, patterns, and outliers among features in a dataset through scatter plot matrices....TrueFalse['dataset']{}['tabular_data', 'visualization']['classification', 'regression']
validmind.data_validation.ScoreBandDefaultRatesScore Band Default RatesAnalyzes default rates and population distribution across credit score bands....FalseTrue['dataset', 'model']{'score_column': {'type': 'str', 'default': 'score'}, 'score_bands': {'type': 'list', 'default': None}}['visualization', 'credit_risk', 'scorecard']['classification']
validmind.data_validation.SeasonalDecomposeSeasonal DecomposeAssesses patterns and seasonality in a time series dataset by decomposing its features into foundational components....TrueFalse['dataset']{'seasonal_model': {'type': 'str', 'default': 'additive'}}['time_series_data', 'seasonality', 'statsmodels']['regression']
validmind.data_validation.ShapiroWilkShapiro WilkEvaluates feature-wise normality of training data using the Shapiro-Wilk test....FalseTrue['dataset']{}['tabular_data', 'data_distribution', 'statistical_test']['classification', 'regression']
validmind.data_validation.SkewnessSkewnessEvaluates the skewness of numerical data in a dataset to check against a defined threshold, aiming to ensure data...FalseTrue['dataset']{'max_threshold': {'type': '_empty', 'default': 1}}['data_quality', 'tabular_data']['classification', 'regression']
validmind.data_validation.SpreadPlotSpread PlotAssesses potential correlations between pairs of time series variables through visualization to enhance...TrueFalse['dataset']{}['time_series_data', 'visualization']['regression']
validmind.data_validation.TabularCategoricalBarPlotsTabular Categorical Bar PlotsGenerates and visualizes bar plots for each category in categorical features to evaluate the dataset's composition....TrueFalse['dataset']{}['tabular_data', 'visualization']['classification', 'regression']
validmind.data_validation.TabularDateTimeHistogramsTabular Date Time HistogramsGenerates histograms to provide graphical insight into the distribution of time intervals in a model's datetime...TrueFalse['dataset']{}['time_series_data', 'visualization']['classification', 'regression']
validmind.data_validation.TabularDescriptionTablesTabular Description TablesSummarizes key descriptive statistics for numerical, categorical, and datetime variables in a dataset....FalseTrue['dataset']{}['tabular_data']['classification', 'regression']
validmind.data_validation.TabularNumericalHistogramsTabular Numerical HistogramsGenerates histograms for each numerical feature in a dataset to provide visual insights into data distribution and...TrueFalse['dataset']{}['tabular_data', 'visualization']['classification', 'regression']
validmind.data_validation.TargetRateBarPlotsTarget Rate Bar PlotsGenerates bar plots visualizing the default rates of categorical features for a classification machine learning...TrueFalse['dataset']{}['tabular_data', 'visualization', 'categorical_data']['classification']
validmind.data_validation.TimeSeriesDescriptionTime Series DescriptionGenerates a detailed analysis for the provided time series dataset, summarizing key statistics to identify trends,...FalseTrue['dataset']{}['time_series_data', 'analysis']['regression']
validmind.data_validation.TimeSeriesDescriptiveStatisticsTime Series Descriptive StatisticsEvaluates the descriptive statistics of a time series dataset to identify trends, patterns, and data quality issues....FalseTrue['dataset']{}['time_series_data', 'analysis']['regression']
validmind.data_validation.TimeSeriesFrequencyTime Series FrequencyEvaluates consistency of time series data frequency and generates a frequency plot....TrueTrue['dataset']{}['time_series_data']['regression']
validmind.data_validation.TimeSeriesHistogramTime Series HistogramVisualizes distribution of time-series data using histograms and Kernel Density Estimation (KDE) lines....TrueFalse['dataset']{'nbins': {'type': '_empty', 'default': 30}}['data_validation', 'visualization', 'time_series_data']['regression', 'time_series_forecasting']
validmind.data_validation.TimeSeriesLinePlotTime Series Line PlotGenerates and analyses time-series data through line plots revealing trends, patterns, anomalies over time....TrueFalse['dataset']{}['time_series_data', 'visualization']['regression']
validmind.data_validation.TimeSeriesMissingValuesTime Series Missing ValuesValidates time-series data quality by confirming the count of missing values is below a certain threshold....TrueTrue['dataset']{'min_threshold': {'type': 'int', 'default': 1}}['time_series_data']['regression']
validmind.data_validation.TimeSeriesOutliersTime Series OutliersIdentifies and visualizes outliers in time-series data using the z-score method....FalseTrue['dataset']{'zscore_threshold': {'type': 'int', 'default': 3}}['time_series_data']['regression']
validmind.data_validation.TooManyZeroValuesToo Many Zero ValuesIdentifies numerical columns in a dataset that contain an excessive number of zero values, defined by a threshold...FalseTrue['dataset']{'max_percent_threshold': {'type': 'float', 'default': 0.03}}['tabular_data']['regression', 'classification']
validmind.data_validation.UniqueRowsUnique RowsVerifies the diversity of the dataset by ensuring that the count of unique rows exceeds a prescribed threshold....FalseTrue['dataset']{'min_percent_threshold': {'type': 'float', 'default': 1}}['tabular_data']['regression', 'classification']
validmind.data_validation.WOEBinPlotsWOE Bin PlotsGenerates visualizations of Weight of Evidence (WoE) and Information Value (IV) for understanding predictive power...TrueFalse['dataset']{'breaks_adj': {'type': 'list', 'default': None}, 'fig_height': {'type': 'int', 'default': 600}, 'fig_width': {'type': 'int', 'default': 500}}['tabular_data', 'visualization', 'categorical_data']['classification']
validmind.data_validation.WOEBinTableWOE Bin TableAssesses the Weight of Evidence (WoE) and Information Value (IV) of each feature to evaluate its predictive power...FalseTrue['dataset']{'breaks_adj': {'type': 'list', 'default': None}}['tabular_data', 'categorical_data']['classification']
validmind.data_validation.ZivotAndrewsArchZivot Andrews ArchEvaluates the order of integration and stationarity of time series data using the Zivot-Andrews unit root test....FalseTrue['dataset']{}['time_series_data', 'stationarity', 'unit_root_test']['regression']
validmind.data_validation.nlp.CommonWordsCommon WordsAssesses the most frequent non-stopwords in a text column for identifying prevalent language patterns....TrueFalse['dataset']{}['nlp', 'text_data', 'visualization', 'frequency_analysis']['text_classification', 'text_summarization']
validmind.data_validation.nlp.HashtagsHashtagsAssesses hashtag frequency in a text column, highlighting usage trends and potential dataset bias or spam....TrueFalse['dataset']{'top_hashtags': {'type': 'int', 'default': 25}}['nlp', 'text_data', 'visualization', 'frequency_analysis']['text_classification', 'text_summarization']
validmind.data_validation.nlp.LanguageDetectionLanguage DetectionAssesses the diversity of languages in a textual dataset by detecting and visualizing the distribution of languages....TrueFalse['dataset']{}['nlp', 'text_data', 'visualization']['text_classification', 'text_summarization']
validmind.data_validation.nlp.MentionsMentionsCalculates and visualizes frequencies of '@' prefixed mentions in a text-based dataset for NLP model analysis....TrueFalse['dataset']{'top_mentions': {'type': 'int', 'default': 25}}['nlp', 'text_data', 'visualization', 'frequency_analysis']['text_classification', 'text_summarization']
validmind.data_validation.nlp.PolarityAndSubjectivityPolarity And SubjectivityAnalyzes the polarity and subjectivity of text data within a given dataset to visualize the sentiment distribution....TrueTrue['dataset']{'threshold_subjectivity': {'type': '_empty', 'default': 0.5}, 'threshold_polarity': {'type': '_empty', 'default': 0}}['nlp', 'text_data', 'data_validation']['nlp']
validmind.data_validation.nlp.PunctuationsPunctuationsAnalyzes and visualizes the frequency distribution of punctuation usage in a given text dataset....TrueFalse['dataset']{'count_mode': {'type': '_empty', 'default': 'token'}}['nlp', 'text_data', 'visualization', 'frequency_analysis']['text_classification', 'text_summarization', 'nlp']
validmind.data_validation.nlp.SentimentSentimentAnalyzes the sentiment of text data within a dataset using the VADER sentiment analysis tool....TrueFalse['dataset']{}['nlp', 'text_data', 'data_validation']['nlp']
validmind.data_validation.nlp.StopWordsStop WordsEvaluates and visualizes the frequency of English stop words in a text dataset against a defined threshold....TrueTrue['dataset']{'min_percent_threshold': {'type': 'float', 'default': 0.5}, 'num_words': {'type': 'int', 'default': 25}}['nlp', 'text_data', 'frequency_analysis', 'visualization']['text_classification', 'text_summarization']
validmind.data_validation.nlp.TextDescriptionText DescriptionConducts comprehensive textual analysis on a dataset using NLTK to evaluate various parameters and generate...TrueFalse['dataset']{'unwanted_tokens': {'type': 'set', 'default': {'s', 'mrs', 'us', \"''\", ' ', 'ms', 'dr', 'dollar', '``', 'mr', \"'s\", \"s'\"}}, 'lang': {'type': 'str', 'default': 'english'}}['nlp', 'text_data', 'visualization']['text_classification', 'text_summarization']
validmind.data_validation.nlp.ToxicityToxicityAssesses the toxicity of text data within a dataset to visualize the distribution of toxicity scores....TrueFalse['dataset']{}['nlp', 'text_data', 'data_validation']['nlp']
validmind.model_validation.BertScoreBert ScoreAssesses the quality of machine-generated text using BERTScore metrics and visualizes results through histograms...TrueTrue['dataset', 'model']{'evaluation_model': {'type': '_empty', 'default': 'distilbert-base-uncased'}}['nlp', 'text_data', 'visualization']['text_classification', 'text_summarization']
validmind.model_validation.BleuScoreBleu ScoreEvaluates the quality of machine-generated text using BLEU metrics and visualizes the results through histograms...TrueTrue['dataset', 'model']{}['nlp', 'text_data', 'visualization']['text_classification', 'text_summarization']
validmind.model_validation.ClusterSizeDistributionCluster Size DistributionAssesses the performance of clustering models by comparing the distribution of cluster sizes in model predictions...TrueFalse['dataset', 'model']{}['sklearn', 'model_performance']['clustering']
validmind.model_validation.ContextualRecallContextual RecallEvaluates a Natural Language Generation model's ability to generate contextually relevant and factually correct...TrueTrue['dataset', 'model']{}['nlp', 'text_data', 'visualization']['text_classification', 'text_summarization']
validmind.model_validation.FeaturesAUCFeatures AUCEvaluates the discriminatory power of each individual feature within a binary classification model by calculating...TrueFalse['dataset']{'fontsize': {'type': 'int', 'default': 12}, 'figure_height': {'type': 'int', 'default': 500}}['feature_importance', 'AUC', 'visualization']['classification']
validmind.model_validation.MeteorScoreMeteor ScoreAssesses the quality of machine-generated translations by comparing them to human-produced references using the...TrueTrue['dataset', 'model']{}['nlp', 'text_data', 'visualization']['text_classification', 'text_summarization']
validmind.model_validation.ModelMetadataModel MetadataCompare metadata of different models and generate a summary table with the results....FalseTrue['model']{}['model_training', 'metadata']['regression', 'time_series_forecasting']
validmind.model_validation.ModelPredictionResidualsModel Prediction ResidualsAssesses normality and behavior of residuals in regression models through visualization and statistical tests....TrueTrue['dataset', 'model']{'nbins': {'type': 'int', 'default': 100}, 'p_value_threshold': {'type': 'float', 'default': 0.05}, 'start_date': {'type': None, 'default': None}, 'end_date': {'type': None, 'default': None}}['regression']['residual_analysis', 'visualization']
validmind.model_validation.RegardScoreRegard ScoreAssesses the sentiment and potential biases in text generated by NLP models by computing and visualizing regard...TrueTrue['dataset', 'model']{}['nlp', 'text_data', 'visualization']['text_classification', 'text_summarization']
validmind.model_validation.RegressionResidualsPlotRegression Residuals PlotEvaluates regression model performance using residual distribution and actual vs. predicted plots....TrueFalse['model', 'dataset']{'bin_size': {'type': 'float', 'default': 0.1}}['model_performance', 'visualization']['regression']
validmind.model_validation.RougeScoreRouge ScoreAssesses the quality of machine-generated text using ROUGE metrics and visualizes the results to provide...TrueTrue['dataset', 'model']{'metric': {'type': 'str', 'default': 'rouge-1'}}['nlp', 'text_data', 'visualization']['text_classification', 'text_summarization']
validmind.model_validation.TimeSeriesPredictionWithCITime Series Prediction With CIAssesses predictive accuracy and uncertainty in time series models, highlighting breaches beyond confidence...TrueTrue['dataset', 'model']{'confidence': {'type': 'float', 'default': 0.95}}['model_predictions', 'visualization']['regression', 'time_series_forecasting']
validmind.model_validation.TimeSeriesPredictionsPlotTime Series Predictions PlotPlot actual vs predicted values for time series data and generate a visual comparison for the model....TrueFalse['dataset', 'model']{}['model_predictions', 'visualization']['regression', 'time_series_forecasting']
validmind.model_validation.TimeSeriesR2SquareBySegmentsTime Series R2 Square By SegmentsEvaluates the R-Squared values of regression models over specified time segments in time series data to assess...TrueTrue['dataset', 'model']{'segments': {'type': None, 'default': None}}['model_performance', 'sklearn']['regression', 'time_series_forecasting']
validmind.model_validation.TokenDisparityToken DisparityEvaluates the token disparity between reference and generated texts, visualizing the results through histograms and...TrueTrue['dataset', 'model']{}['nlp', 'text_data', 'visualization']['text_classification', 'text_summarization']
validmind.model_validation.ToxicityScoreToxicity ScoreAssesses the toxicity levels of texts generated by NLP models to identify and mitigate harmful or offensive content....TrueTrue['dataset', 'model']{}['nlp', 'text_data', 'visualization']['text_classification', 'text_summarization']
validmind.model_validation.embeddings.ClusterDistributionCluster DistributionAssesses the distribution of text embeddings across clusters produced by a model using KMeans clustering....TrueFalse['model', 'dataset']{'num_clusters': {'type': 'int', 'default': 5}}['llm', 'text_data', 'embeddings', 'visualization']['feature_extraction']
validmind.model_validation.embeddings.CosineSimilarityComparisonCosine Similarity ComparisonAssesses the similarity between embeddings generated by different models using Cosine Similarity, providing both...TrueTrue['dataset', 'models']{}['visualization', 'dimensionality_reduction', 'embeddings']['text_qa', 'text_generation', 'text_summarization']
validmind.model_validation.embeddings.CosineSimilarityDistributionCosine Similarity DistributionAssesses the similarity between predicted text embeddings from a model using a Cosine Similarity distribution...TrueFalse['dataset', 'model']{}['llm', 'text_data', 'embeddings', 'visualization']['feature_extraction']
validmind.model_validation.embeddings.CosineSimilarityHeatmapCosine Similarity HeatmapGenerates an interactive heatmap to visualize the cosine similarities among embeddings derived from a given model....TrueFalse['dataset', 'model']{'title': {'type': '_empty', 'default': 'Cosine Similarity Matrix'}, 'color': {'type': '_empty', 'default': 'Cosine Similarity'}, 'xaxis_title': {'type': '_empty', 'default': 'Index'}, 'yaxis_title': {'type': '_empty', 'default': 'Index'}, 'color_scale': {'type': '_empty', 'default': 'Blues'}}['visualization', 'dimensionality_reduction', 'embeddings']['text_qa', 'text_generation', 'text_summarization']
validmind.model_validation.embeddings.DescriptiveAnalyticsDescriptive AnalyticsEvaluates statistical properties of text embeddings in an ML model via mean, median, and standard deviation...TrueFalse['dataset', 'model']{}['llm', 'text_data', 'embeddings', 'visualization']['feature_extraction']
validmind.model_validation.embeddings.EmbeddingsVisualization2DEmbeddings Visualization2 DVisualizes 2D representation of text embeddings generated by a model using t-SNE technique....TrueFalse['dataset', 'model']{'cluster_column': {'type': None, 'default': None}, 'perplexity': {'type': 'int', 'default': 30}}['llm', 'text_data', 'embeddings', 'visualization']['feature_extraction']
validmind.model_validation.embeddings.EuclideanDistanceComparisonEuclidean Distance ComparisonAssesses and visualizes the dissimilarity between model embeddings using Euclidean distance, providing insights...TrueTrue['dataset', 'models']{}['visualization', 'dimensionality_reduction', 'embeddings']['text_qa', 'text_generation', 'text_summarization']
validmind.model_validation.embeddings.EuclideanDistanceHeatmapEuclidean Distance HeatmapGenerates an interactive heatmap to visualize the Euclidean distances among embeddings derived from a given model....TrueFalse['dataset', 'model']{'title': {'type': '_empty', 'default': 'Euclidean Distance Matrix'}, 'color': {'type': '_empty', 'default': 'Euclidean Distance'}, 'xaxis_title': {'type': '_empty', 'default': 'Index'}, 'yaxis_title': {'type': '_empty', 'default': 'Index'}, 'color_scale': {'type': '_empty', 'default': 'Blues'}}['visualization', 'dimensionality_reduction', 'embeddings']['text_qa', 'text_generation', 'text_summarization']
validmind.model_validation.embeddings.PCAComponentsPairwisePlotsPCA Components Pairwise PlotsGenerates scatter plots for pairwise combinations of principal component analysis (PCA) components of model...TrueFalse['dataset', 'model']{'n_components': {'type': 'int', 'default': 3}}['visualization', 'dimensionality_reduction', 'embeddings']['text_qa', 'text_generation', 'text_summarization']
validmind.model_validation.embeddings.StabilityAnalysisKeywordStability Analysis KeywordEvaluates robustness of embedding models to keyword swaps in the test dataset....TrueTrue['dataset', 'model']{'keyword_dict': {'type': None, 'default': None}, 'mean_similarity_threshold': {'type': 'float', 'default': 0.7}}['llm', 'text_data', 'embeddings', 'visualization']['feature_extraction']
validmind.model_validation.embeddings.StabilityAnalysisRandomNoiseStability Analysis Random NoiseAssesses the robustness of text embeddings models to random noise introduced via text perturbations....TrueTrue['dataset', 'model']{'probability': {'type': 'float', 'default': 0.02}, 'mean_similarity_threshold': {'type': 'float', 'default': 0.7}}['llm', 'text_data', 'embeddings', 'visualization']['feature_extraction']
validmind.model_validation.embeddings.StabilityAnalysisSynonymsStability Analysis SynonymsEvaluates the stability of text embeddings models when words in test data are replaced by their synonyms randomly....TrueTrue['dataset', 'model']{'probability': {'type': 'float', 'default': 0.02}, 'mean_similarity_threshold': {'type': 'float', 'default': 0.7}}['llm', 'text_data', 'embeddings', 'visualization']['feature_extraction']
validmind.model_validation.embeddings.StabilityAnalysisTranslationStability Analysis TranslationEvaluates robustness of text embeddings models to noise introduced by translating the original text to another...TrueTrue['dataset', 'model']{'source_lang': {'type': 'str', 'default': 'en'}, 'target_lang': {'type': 'str', 'default': 'fr'}, 'mean_similarity_threshold': {'type': 'float', 'default': 0.7}}['llm', 'text_data', 'embeddings', 'visualization']['feature_extraction']
validmind.model_validation.embeddings.TSNEComponentsPairwisePlotsTSNE Components Pairwise PlotsCreates scatter plots for pairwise combinations of t-SNE components to visualize embeddings and highlight potential...TrueFalse['dataset', 'model']{'n_components': {'type': 'int', 'default': 2}, 'perplexity': {'type': 'int', 'default': 30}, 'title': {'type': 'str', 'default': 't-SNE'}}['visualization', 'dimensionality_reduction', 'embeddings']['text_qa', 'text_generation', 'text_summarization']
validmind.model_validation.ragas.AnswerCorrectnessAnswer CorrectnessEvaluates the correctness of answers in a dataset with respect to the provided ground...TrueTrue['dataset']{'user_input_column': {'type': 'str', 'default': 'user_input'}, 'response_column': {'type': 'str', 'default': 'response'}, 'reference_column': {'type': 'str', 'default': 'reference'}, 'judge_llm': {'type': '_empty', 'default': None}, 'judge_embeddings': {'type': '_empty', 'default': None}}['ragas', 'llm']['text_qa', 'text_generation', 'text_summarization']
validmind.model_validation.ragas.AspectCriticAspect CriticEvaluates generations against the following aspects: harmfulness, maliciousness,...TrueTrue['dataset']{'user_input_column': {'type': 'str', 'default': 'user_input'}, 'response_column': {'type': 'str', 'default': 'response'}, 'retrieved_contexts_column': {'type': None, 'default': None}, 'aspects': {'type': None, 'default': ['coherence', 'conciseness', 'correctness', 'harmfulness', 'maliciousness']}, 'additional_aspects': {'type': None, 'default': None}, 'judge_llm': {'type': '_empty', 'default': None}, 'judge_embeddings': {'type': '_empty', 'default': None}}['ragas', 'llm', 'qualitative']['text_summarization', 'text_generation', 'text_qa']
validmind.model_validation.ragas.ContextEntityRecallContext Entity RecallEvaluates the context entity recall for dataset entries and visualizes the results....TrueTrue['dataset']{'retrieved_contexts_column': {'type': 'str', 'default': 'retrieved_contexts'}, 'reference_column': {'type': 'str', 'default': 'reference'}, 'judge_llm': {'type': '_empty', 'default': None}, 'judge_embeddings': {'type': '_empty', 'default': None}}['ragas', 'llm', 'retrieval_performance']['text_qa', 'text_generation', 'text_summarization']
validmind.model_validation.ragas.ContextPrecisionContext PrecisionContext Precision is a metric that evaluates whether all of the ground-truth...TrueTrue['dataset']{'user_input_column': {'type': 'str', 'default': 'user_input'}, 'retrieved_contexts_column': {'type': 'str', 'default': 'retrieved_contexts'}, 'reference_column': {'type': 'str', 'default': 'reference'}, 'judge_llm': {'type': '_empty', 'default': None}, 'judge_embeddings': {'type': '_empty', 'default': None}}['ragas', 'llm', 'retrieval_performance']['text_qa', 'text_generation', 'text_summarization', 'text_classification']
validmind.model_validation.ragas.ContextPrecisionWithoutReferenceContext Precision Without ReferenceContext Precision Without Reference is a metric used to evaluate the relevance of...TrueTrue['dataset']{'user_input_column': {'type': 'str', 'default': 'user_input'}, 'retrieved_contexts_column': {'type': 'str', 'default': 'retrieved_contexts'}, 'response_column': {'type': 'str', 'default': 'response'}, 'judge_llm': {'type': '_empty', 'default': None}, 'judge_embeddings': {'type': '_empty', 'default': None}}['ragas', 'llm', 'retrieval_performance']['text_qa', 'text_generation', 'text_summarization', 'text_classification']
validmind.model_validation.ragas.ContextRecallContext RecallContext recall measures the extent to which the retrieved context aligns with the...TrueTrue['dataset']{'user_input_column': {'type': 'str', 'default': 'user_input'}, 'retrieved_contexts_column': {'type': 'str', 'default': 'retrieved_contexts'}, 'reference_column': {'type': 'str', 'default': 'reference'}, 'judge_llm': {'type': '_empty', 'default': None}, 'judge_embeddings': {'type': '_empty', 'default': None}}['ragas', 'llm', 'retrieval_performance']['text_qa', 'text_generation', 'text_summarization', 'text_classification']
validmind.model_validation.ragas.FaithfulnessFaithfulnessEvaluates the faithfulness of the generated answers with respect to retrieved contexts....TrueTrue['dataset']{'user_input_column': {'type': 'str', 'default': 'user_input'}, 'response_column': {'type': 'str', 'default': 'response'}, 'retrieved_contexts_column': {'type': 'str', 'default': 'retrieved_contexts'}, 'judge_llm': {'type': '_empty', 'default': None}, 'judge_embeddings': {'type': '_empty', 'default': None}}['ragas', 'llm', 'rag_performance']['text_qa', 'text_generation', 'text_summarization']
validmind.model_validation.ragas.NoiseSensitivityNoise SensitivityAssesses the sensitivity of a Large Language Model (LLM) to noise in retrieved context by measuring how often it...TrueTrue['dataset']{'response_column': {'type': 'str', 'default': 'response'}, 'retrieved_contexts_column': {'type': 'str', 'default': 'retrieved_contexts'}, 'reference_column': {'type': 'str', 'default': 'reference'}, 'focus': {'type': 'str', 'default': 'relevant'}, 'user_input_column': {'type': 'str', 'default': 'user_input'}, 'judge_llm': {'type': '_empty', 'default': None}, 'judge_embeddings': {'type': '_empty', 'default': None}}['ragas', 'llm', 'rag_performance']['text_qa', 'text_generation', 'text_summarization']
validmind.model_validation.ragas.ResponseRelevancyResponse RelevancyAssesses how pertinent the generated answer is to the given prompt....TrueTrue['dataset']{'user_input_column': {'type': 'str', 'default': 'user_input'}, 'retrieved_contexts_column': {'type': 'str', 'default': None}, 'response_column': {'type': 'str', 'default': 'response'}, 'judge_llm': {'type': '_empty', 'default': None}, 'judge_embeddings': {'type': '_empty', 'default': None}}['ragas', 'llm', 'rag_performance']['text_qa', 'text_generation', 'text_summarization']
validmind.model_validation.ragas.SemanticSimilaritySemantic SimilarityCalculates the semantic similarity between generated responses and ground truths...TrueTrue['dataset']{'response_column': {'type': 'str', 'default': 'response'}, 'reference_column': {'type': 'str', 'default': 'reference'}, 'judge_llm': {'type': '_empty', 'default': None}, 'judge_embeddings': {'type': '_empty', 'default': None}}['ragas', 'llm']['text_qa', 'text_generation', 'text_summarization']
validmind.model_validation.sklearn.AdjustedMutualInformationAdjusted Mutual InformationEvaluates clustering model performance by measuring mutual information between true and predicted labels, adjusting...FalseTrue['model', 'dataset']{}['sklearn', 'model_performance', 'clustering']['clustering']
validmind.model_validation.sklearn.AdjustedRandIndexAdjusted Rand IndexMeasures the similarity between two data clusters using the Adjusted Rand Index (ARI) metric in clustering machine...FalseTrue['model', 'dataset']{}['sklearn', 'model_performance', 'clustering']['clustering']
validmind.model_validation.sklearn.CalibrationCurveCalibration CurveEvaluates the calibration of probability estimates by comparing predicted probabilities against observed...TrueFalse['model', 'dataset']{'n_bins': {'type': 'int', 'default': 10}}['sklearn', 'model_performance', 'classification']['classification']
validmind.model_validation.sklearn.ClassifierPerformanceClassifier PerformanceEvaluates performance of binary or multiclass classification models using precision, recall, F1-Score, accuracy,...FalseTrue['dataset', 'model']{'average': {'type': 'str', 'default': 'macro'}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']['classification', 'text_classification']
validmind.model_validation.sklearn.ClassifierThresholdOptimizationClassifier Threshold OptimizationAnalyzes and visualizes different threshold optimization methods for binary classification models....FalseTrue['dataset', 'model']{'methods': {'type': None, 'default': None}, 'target_recall': {'type': None, 'default': None}}['model_validation', 'threshold_optimization', 'classification_metrics']['classification']
validmind.model_validation.sklearn.ClusterCosineSimilarityCluster Cosine SimilarityMeasures the intra-cluster similarity of a clustering model using cosine similarity....FalseTrue['model', 'dataset']{}['sklearn', 'model_performance', 'clustering']['clustering']
validmind.model_validation.sklearn.ClusterPerformanceMetricsCluster Performance MetricsEvaluates the performance of clustering machine learning models using multiple established metrics....FalseTrue['model', 'dataset']{}['sklearn', 'model_performance', 'clustering']['clustering']
validmind.model_validation.sklearn.CompletenessScoreCompleteness ScoreEvaluates a clustering model's capacity to categorize instances from a single class into the same cluster....FalseTrue['model', 'dataset']{}['sklearn', 'model_performance', 'clustering']['clustering']
validmind.model_validation.sklearn.ConfusionMatrixConfusion MatrixEvaluates and visually represents the classification ML model's predictive performance using a Confusion Matrix...TrueFalse['dataset', 'model']{'threshold': {'type': 'float', 'default': 0.5}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization']['classification', 'text_classification']
validmind.model_validation.sklearn.FeatureImportanceFeature ImportanceCompute feature importance scores for a given model and generate a summary table...FalseTrue['dataset', 'model']{'num_features': {'type': 'int', 'default': 3}}['model_explainability', 'sklearn']['regression', 'time_series_forecasting']
validmind.model_validation.sklearn.FowlkesMallowsScoreFowlkes Mallows ScoreEvaluates the similarity between predicted and actual cluster assignments in a model using the Fowlkes-Mallows...FalseTrue['dataset', 'model']{}['sklearn', 'model_performance']['clustering']
validmind.model_validation.sklearn.HomogeneityScoreHomogeneity ScoreAssesses clustering homogeneity by comparing true and predicted labels, scoring from 0 (heterogeneous) to 1...FalseTrue['dataset', 'model']{}['sklearn', 'model_performance']['clustering']
validmind.model_validation.sklearn.HyperParametersTuningHyper Parameters TuningPerforms exhaustive grid search over specified parameter ranges to find optimal model configurations...FalseTrue['model', 'dataset']{'param_grid': {'type': 'dict', 'default': None}, 'scoring': {'type': None, 'default': None}, 'thresholds': {'type': None, 'default': None}, 'fit_params': {'type': 'dict', 'default': None}}['sklearn', 'model_performance']['clustering', 'classification']
validmind.model_validation.sklearn.KMeansClustersOptimizationK Means Clusters OptimizationOptimizes the number of clusters in K-means models using Elbow and Silhouette methods....TrueFalse['model', 'dataset']{'n_clusters': {'type': None, 'default': None}}['sklearn', 'model_performance', 'kmeans']['clustering']
validmind.model_validation.sklearn.MinimumAccuracyMinimum AccuracyChecks if the model's prediction accuracy meets or surpasses a specified threshold....FalseTrue['dataset', 'model']{'min_threshold': {'type': 'float', 'default': 0.7}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']['classification', 'text_classification']
validmind.model_validation.sklearn.MinimumF1ScoreMinimum F1 ScoreAssesses if the model's F1 score on the validation set meets a predefined minimum threshold, ensuring balanced...FalseTrue['dataset', 'model']{'min_threshold': {'type': 'float', 'default': 0.5}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']['classification', 'text_classification']
validmind.model_validation.sklearn.MinimumROCAUCScoreMinimum ROCAUC ScoreValidates model by checking if the ROC AUC score meets or surpasses a specified threshold....FalseTrue['dataset', 'model']{'min_threshold': {'type': 'float', 'default': 0.5}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']['classification', 'text_classification']
validmind.model_validation.sklearn.ModelParametersModel ParametersExtracts and displays model parameters in a structured format for transparency and reproducibility....FalseTrue['model']{'model_params': {'type': None, 'default': None}}['model_training', 'metadata']['classification', 'regression']
validmind.model_validation.sklearn.ModelsPerformanceComparisonModels Performance ComparisonEvaluates and compares the performance of multiple Machine Learning models using various metrics like accuracy,...FalseTrue['dataset', 'models']{}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'model_comparison']['classification', 'text_classification']
validmind.model_validation.sklearn.OverfitDiagnosisOverfit DiagnosisAssesses potential overfitting in a model's predictions, identifying regions where performance between training and...TrueTrue['model', 'datasets']{'metric': {'type': 'str', 'default': None}, 'cut_off_threshold': {'type': 'float', 'default': 0.04}}['sklearn', 'binary_classification', 'multiclass_classification', 'linear_regression', 'model_diagnosis']['classification', 'regression']
validmind.model_validation.sklearn.PermutationFeatureImportancePermutation Feature ImportanceAssesses the significance of each feature in a model by evaluating the impact on model performance when feature...TrueFalse['model', 'dataset']{'fontsize': {'type': None, 'default': None}, 'figure_height': {'type': None, 'default': None}}['sklearn', 'binary_classification', 'multiclass_classification', 'feature_importance', 'visualization']['classification', 'text_classification']
validmind.model_validation.sklearn.PopulationStabilityIndexPopulation Stability IndexAssesses the Population Stability Index (PSI) to quantify the stability of an ML model's predictions across...TrueTrue['datasets', 'model']{'num_bins': {'type': 'int', 'default': 10}, 'mode': {'type': 'str', 'default': 'fixed'}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']['classification', 'text_classification']
validmind.model_validation.sklearn.PrecisionRecallCurvePrecision Recall CurveEvaluates the precision-recall trade-off for binary classification models and visualizes the Precision-Recall curve....TrueFalse['model', 'dataset']{}['sklearn', 'binary_classification', 'model_performance', 'visualization']['classification', 'text_classification']
validmind.model_validation.sklearn.ROCCurveROC CurveEvaluates binary classification model performance by generating and plotting the Receiver Operating Characteristic...TrueFalse['model', 'dataset']{}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization']['classification', 'text_classification']
validmind.model_validation.sklearn.RegressionErrorsRegression ErrorsAssesses the performance and error distribution of a regression model using various error metrics....FalseTrue['model', 'dataset']{}['sklearn', 'model_performance']['regression', 'classification']
validmind.model_validation.sklearn.RegressionErrorsComparisonRegression Errors ComparisonAssesses multiple regression error metrics to compare model performance across different datasets, emphasizing...FalseTrue['datasets', 'models']{}['model_performance', 'sklearn']['regression', 'time_series_forecasting']
validmind.model_validation.sklearn.RegressionPerformanceRegression PerformanceEvaluates the performance of a regression model using five different metrics: MAE, MSE, RMSE, MAPE, and MBD....FalseTrue['model', 'dataset']{}['sklearn', 'model_performance']['regression']
validmind.model_validation.sklearn.RegressionR2SquareRegression R2 SquareAssesses the overall goodness-of-fit of a regression model by evaluating R-squared (R2) and Adjusted R-squared (Adj...FalseTrue['dataset', 'model']{}['sklearn', 'model_performance']['regression']
validmind.model_validation.sklearn.RegressionR2SquareComparisonRegression R2 Square ComparisonCompares R-Squared and Adjusted R-Squared values for different regression models across multiple datasets to assess...FalseTrue['datasets', 'models']{}['model_performance', 'sklearn']['regression', 'time_series_forecasting']
validmind.model_validation.sklearn.RobustnessDiagnosisRobustness DiagnosisAssesses the robustness of a machine learning model by evaluating performance decay under noisy conditions....TrueTrue['datasets', 'model']{'metric': {'type': 'str', 'default': None}, 'scaling_factor_std_dev_list': {'type': None, 'default': [0.1, 0.2, 0.3, 0.4, 0.5]}, 'performance_decay_threshold': {'type': 'float', 'default': 0.05}}['sklearn', 'model_diagnosis', 'visualization']['classification', 'regression']
validmind.model_validation.sklearn.SHAPGlobalImportanceSHAP Global ImportanceEvaluates and visualizes global feature importance using SHAP values for model explanation and risk identification....FalseTrue['model', 'dataset']{'kernel_explainer_samples': {'type': 'int', 'default': 10}, 'tree_or_linear_explainer_samples': {'type': 'int', 'default': 200}, 'class_of_interest': {'type': None, 'default': None}}['sklearn', 'binary_classification', 'multiclass_classification', 'feature_importance', 'visualization']['classification', 'text_classification']
validmind.model_validation.sklearn.ScoreProbabilityAlignmentScore Probability AlignmentAnalyzes the alignment between credit scores and predicted probabilities....TrueTrue['model', 'dataset']{'score_column': {'type': 'str', 'default': 'score'}, 'n_bins': {'type': 'int', 'default': 10}}['visualization', 'credit_risk', 'calibration']['classification']
validmind.model_validation.sklearn.SilhouettePlotSilhouette PlotCalculates and visualizes Silhouette Score, assessing the degree of data point suitability to its cluster in ML...TrueTrue['model', 'dataset']{}['sklearn', 'model_performance']['clustering']
validmind.model_validation.sklearn.TrainingTestDegradationTraining Test DegradationTests if model performance degradation between training and test datasets exceeds a predefined threshold....FalseTrue['datasets', 'model']{'max_threshold': {'type': 'float', 'default': 0.1}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization']['classification', 'text_classification']
validmind.model_validation.sklearn.VMeasureV MeasureEvaluates homogeneity and completeness of a clustering model using the V Measure Score....FalseTrue['dataset', 'model']{}['sklearn', 'model_performance']['clustering']
validmind.model_validation.sklearn.WeakspotsDiagnosisWeakspots DiagnosisIdentifies and visualizes weak spots in a machine learning model's performance across various sections of the...TrueTrue['datasets', 'model']{'features_columns': {'type': None, 'default': None}, 'metrics': {'type': None, 'default': None}, 'thresholds': {'type': None, 'default': None}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_diagnosis', 'visualization']['classification', 'text_classification']
validmind.model_validation.statsmodels.AutoARIMAAuto ARIMAEvaluates ARIMA models for time-series forecasting, ranking them using Bayesian and Akaike Information Criteria....FalseTrue['model', 'dataset']{}['time_series_data', 'forecasting', 'model_selection', 'statsmodels']['regression']
validmind.model_validation.statsmodels.CumulativePredictionProbabilitiesCumulative Prediction ProbabilitiesVisualizes cumulative probabilities of positive and negative classes for both training and testing in classification models....TrueFalse['dataset', 'model']{'title': {'type': 'str', 'default': 'Cumulative Probabilities'}}['visualization', 'credit_risk']['classification']
validmind.model_validation.statsmodels.DurbinWatsonTestDurbin Watson TestAssesses autocorrelation in time series data features using the Durbin-Watson statistic....FalseTrue['dataset', 'model']{'threshold': {'type': None, 'default': [1.5, 2.5]}}['time_series_data', 'forecasting', 'statistical_test', 'statsmodels']['regression']
validmind.model_validation.statsmodels.GINITableGINI TableEvaluates classification model performance using AUC, GINI, and KS metrics for training and test datasets....FalseTrue['dataset', 'model']{}['model_performance']['classification']
validmind.model_validation.statsmodels.KolmogorovSmirnovKolmogorov SmirnovAssesses whether each feature in the dataset aligns with a normal distribution using the Kolmogorov-Smirnov test....FalseTrue['model', 'dataset']{'dist': {'type': 'str', 'default': 'norm'}}['tabular_data', 'data_distribution', 'statistical_test', 'statsmodels']['classification', 'regression']
validmind.model_validation.statsmodels.LillieforsLillieforsAssesses the normality of feature distributions in an ML model's training dataset using the Lilliefors test....FalseTrue['dataset']{}['tabular_data', 'data_distribution', 'statistical_test', 'statsmodels']['classification', 'regression']
validmind.model_validation.statsmodels.PredictionProbabilitiesHistogramPrediction Probabilities HistogramAssesses the predictive probability distribution for binary classification to evaluate model performance and...TrueFalse['dataset', 'model']{'title': {'type': 'str', 'default': 'Histogram of Predictive Probabilities'}}['visualization', 'credit_risk']['classification']
validmind.model_validation.statsmodels.RegressionCoeffsRegression CoeffsAssesses the significance and uncertainty of predictor variables in a regression model through visualization of...TrueTrue['model']{}['tabular_data', 'visualization', 'model_training']['regression']
validmind.model_validation.statsmodels.RegressionFeatureSignificanceRegression Feature SignificanceAssesses and visualizes the statistical significance of features in a regression model....TrueFalse['model']{'fontsize': {'type': 'int', 'default': 10}, 'p_threshold': {'type': 'float', 'default': 0.05}}['statistical_test', 'model_interpretation', 'visualization', 'feature_importance']['regression']
validmind.model_validation.statsmodels.RegressionModelForecastPlotRegression Model Forecast PlotGenerates plots to visually compare the forecasted outcomes of a regression model against actual observed values over...TrueFalse['model', 'dataset']{'start_date': {'type': None, 'default': None}, 'end_date': {'type': None, 'default': None}}['time_series_data', 'forecasting', 'visualization']['regression']
validmind.model_validation.statsmodels.RegressionModelForecastPlotLevelsRegression Model Forecast Plot LevelsAssesses the alignment between forecasted and observed values in regression models through visual plots...TrueFalse['model', 'dataset']{}['time_series_data', 'forecasting', 'visualization']['regression']
validmind.model_validation.statsmodels.RegressionModelSensitivityPlotRegression Model Sensitivity PlotAssesses the sensitivity of a regression model to changes in independent variables by applying shocks and...TrueFalse['dataset', 'model']{'shocks': {'type': None, 'default': [0.1]}, 'transformation': {'type': None, 'default': None}}['senstivity_analysis', 'visualization']['regression']
validmind.model_validation.statsmodels.RegressionModelSummaryRegression Model SummaryEvaluates regression model performance using metrics including R-Squared, Adjusted R-Squared, MSE, and RMSE....FalseTrue['dataset', 'model']{}['model_performance', 'regression']['regression']
validmind.model_validation.statsmodels.RegressionPermutationFeatureImportanceRegression Permutation Feature ImportanceAssesses the significance of each feature in a model by evaluating the impact on model performance when feature...TrueFalse['dataset', 'model']{'fontsize': {'type': 'int', 'default': 12}, 'figure_height': {'type': 'int', 'default': 500}}['statsmodels', 'feature_importance', 'visualization']['regression']
validmind.model_validation.statsmodels.ScorecardHistogramScorecard HistogramThe Scorecard Histogram test evaluates the distribution of credit scores between default and non-default instances,...TrueFalse['dataset']{'title': {'type': 'str', 'default': 'Histogram of Scores'}, 'score_column': {'type': 'str', 'default': 'score'}}['visualization', 'credit_risk', 'logistic_regression']['classification']
validmind.ongoing_monitoring.CalibrationCurveDriftCalibration Curve DriftEvaluates changes in probability calibration between reference and monitoring datasets....TrueTrue['datasets', 'model']{'n_bins': {'type': 'int', 'default': 10}, 'drift_pct_threshold': {'type': 'float', 'default': 20}}['sklearn', 'binary_classification', 'model_performance', 'visualization']['classification', 'text_classification']
validmind.ongoing_monitoring.ClassDiscriminationDriftClass Discrimination DriftCompares classification discrimination metrics between reference and monitoring datasets....FalseTrue['datasets', 'model']{'drift_pct_threshold': {'type': '_empty', 'default': 20}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']['classification', 'text_classification']
validmind.ongoing_monitoring.ClassImbalanceDriftClass Imbalance DriftEvaluates drift in class distribution between reference and monitoring datasets....TrueTrue['datasets']{'drift_pct_threshold': {'type': 'float', 'default': 5.0}, 'title': {'type': 'str', 'default': 'Class Distribution Drift'}}['tabular_data', 'binary_classification', 'multiclass_classification']['classification']
validmind.ongoing_monitoring.ClassificationAccuracyDriftClassification Accuracy DriftCompares classification accuracy metrics between reference and monitoring datasets....FalseTrue['datasets', 'model']{'drift_pct_threshold': {'type': '_empty', 'default': 20}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']['classification', 'text_classification']
validmind.ongoing_monitoring.ConfusionMatrixDriftConfusion Matrix DriftCompares confusion matrix metrics between reference and monitoring datasets....FalseTrue['datasets', 'model']{'drift_pct_threshold': {'type': '_empty', 'default': 20}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']['classification', 'text_classification']
validmind.ongoing_monitoring.CumulativePredictionProbabilitiesDriftCumulative Prediction Probabilities DriftCompares cumulative prediction probability distributions between reference and monitoring datasets....TrueFalse['datasets', 'model']{}['visualization', 'credit_risk']['classification']
validmind.ongoing_monitoring.FeatureDriftFeature DriftEvaluates changes in feature distribution over time to identify potential model drift....TrueTrue['datasets']{'bins': {'type': '_empty', 'default': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]}, 'feature_columns': {'type': '_empty', 'default': None}, 'psi_threshold': {'type': '_empty', 'default': 0.2}}['visualization']['monitoring']
validmind.ongoing_monitoring.PredictionAcrossEachFeaturePrediction Across Each FeatureAssesses differences in model predictions across individual features between reference and monitoring datasets...TrueFalse['datasets', 'model']{}['visualization']['monitoring']
validmind.ongoing_monitoring.PredictionCorrelationPrediction CorrelationAssesses correlation changes between model predictions from reference and monitoring datasets to detect potential...TrueTrue['datasets', 'model']{'drift_pct_threshold': {'type': 'float', 'default': 20}}['visualization']['monitoring']
validmind.ongoing_monitoring.PredictionProbabilitiesHistogramDriftPrediction Probabilities Histogram DriftCompares prediction probability distributions between reference and monitoring datasets....TrueTrue['datasets', 'model']{'title': {'type': '_empty', 'default': 'Prediction Probabilities Histogram Drift'}, 'drift_pct_threshold': {'type': 'float', 'default': 20.0}}['visualization', 'credit_risk']['classification']
validmind.ongoing_monitoring.PredictionQuantilesAcrossFeaturesPrediction Quantiles Across FeaturesAssesses differences in model prediction distributions across individual features between reference...TrueFalse['datasets', 'model']{}['visualization']['monitoring']
validmind.ongoing_monitoring.ROCCurveDriftROC Curve DriftCompares ROC curves between reference and monitoring datasets....TrueFalse['datasets', 'model']{}['sklearn', 'binary_classification', 'model_performance', 'visualization']['classification', 'text_classification']
validmind.ongoing_monitoring.ScoreBandsDriftScore Bands DriftAnalyzes drift in population distribution and default rates across score bands....FalseTrue['datasets', 'model']{'score_column': {'type': 'str', 'default': 'score'}, 'score_bands': {'type': 'list', 'default': None}, 'drift_threshold': {'type': 'float', 'default': 20.0}}['visualization', 'credit_risk', 'scorecard']['classification']
validmind.ongoing_monitoring.ScorecardHistogramDriftScorecard Histogram DriftCompares score distributions between reference and monitoring datasets for each class....TrueTrue['datasets']{'score_column': {'type': 'str', 'default': 'score'}, 'title': {'type': 'str', 'default': 'Scorecard Histogram Drift'}, 'drift_pct_threshold': {'type': 'float', 'default': 20.0}}['visualization', 'credit_risk', 'logistic_regression']['classification']
validmind.ongoing_monitoring.TargetPredictionDistributionPlotTarget Prediction Distribution PlotAssesses differences in prediction distributions between a reference dataset and a monitoring dataset to identify...TrueTrue['datasets', 'model']{'drift_pct_threshold': {'type': 'float', 'default': 20}}['visualization']['monitoring']
validmind.prompt_validation.BiasBiasAssesses potential bias in a Large Language Model by analyzing the distribution and order of exemplars in the...FalseTrue['model']{'min_threshold': {'type': '_empty', 'default': 7}, 'judge_llm': {'type': '_empty', 'default': None}}['llm', 'few_shot']['text_classification', 'text_summarization']
validmind.prompt_validation.ClarityClarityEvaluates and scores the clarity of prompts in a Large Language Model based on specified guidelines....FalseTrue['model']{'min_threshold': {'type': '_empty', 'default': 7}, 'judge_llm': {'type': '_empty', 'default': None}}['llm', 'zero_shot', 'few_shot']['text_classification', 'text_summarization']
validmind.prompt_validation.ConcisenessConcisenessAnalyzes and grades the conciseness of prompts provided to a Large Language Model....FalseTrue['model']{'min_threshold': {'type': '_empty', 'default': 7}, 'judge_llm': {'type': '_empty', 'default': None}}['llm', 'zero_shot', 'few_shot']['text_classification', 'text_summarization']
validmind.prompt_validation.DelimitationDelimitationEvaluates the proper use of delimiters in prompts provided to Large Language Models....FalseTrue['model']{'min_threshold': {'type': '_empty', 'default': 7}, 'judge_llm': {'type': '_empty', 'default': None}}['llm', 'zero_shot', 'few_shot']['text_classification', 'text_summarization']
validmind.prompt_validation.NegativeInstructionNegative InstructionEvaluates and grades the use of affirmative, proactive language over negative instructions in LLM prompts....FalseTrue['model']{'min_threshold': {'type': '_empty', 'default': 7}, 'judge_llm': {'type': '_empty', 'default': None}}['llm', 'zero_shot', 'few_shot']['text_classification', 'text_summarization']
validmind.prompt_validation.RobustnessRobustnessAssesses the robustness of prompts provided to a Large Language Model under varying conditions and contexts. This test...FalseTrue['model', 'dataset']{'num_tests': {'type': '_empty', 'default': 10}, 'judge_llm': {'type': '_empty', 'default': None}}['llm', 'zero_shot', 'few_shot']['text_classification', 'text_summarization']
validmind.prompt_validation.SpecificitySpecificityEvaluates and scores the specificity of prompts provided to a Large Language Model (LLM), based on clarity, detail,...FalseTrue['model']{'min_threshold': {'type': '_empty', 'default': 7}, 'judge_llm': {'type': '_empty', 'default': None}}['llm', 'zero_shot', 'few_shot']['text_classification', 'text_summarization']
validmind.unit_metrics.classification.AccuracyAccuracyCalculates the accuracy of a modelFalseFalse['dataset', 'model']{}['classification']['classification']
validmind.unit_metrics.classification.F1F1Calculates the F1 score for a classification model.FalseFalse['model', 'dataset']{}['classification']['classification']
validmind.unit_metrics.classification.PrecisionPrecisionCalculates the precision for a classification model.FalseFalse['model', 'dataset']{}['classification']['classification']
validmind.unit_metrics.classification.ROC_AUCROC AUCCalculates the ROC AUC for a classification model.FalseFalse['model', 'dataset']{}['classification']['classification']
validmind.unit_metrics.classification.RecallRecallCalculates the recall for a classification model.FalseFalse['model', 'dataset']{}['classification']['classification']
validmind.unit_metrics.regression.AdjustedRSquaredScoreAdjusted R Squared ScoreCalculates the adjusted R-squared score for a regression model.FalseFalse['model', 'dataset']{}['regression']['regression']
validmind.unit_metrics.regression.GiniCoefficientGini CoefficientCalculates the Gini coefficient for a regression model.FalseFalse['dataset', 'model']{}['regression']['regression']
validmind.unit_metrics.regression.HuberLossHuber LossCalculates the Huber loss for a regression model.FalseFalse['model', 'dataset']{}['regression']['regression']
validmind.unit_metrics.regression.KolmogorovSmirnovStatisticKolmogorov Smirnov StatisticCalculates the Kolmogorov-Smirnov statistic for a regression model.FalseFalse['dataset', 'model']{}['regression']['regression']
validmind.unit_metrics.regression.MeanAbsoluteErrorMean Absolute ErrorCalculates the mean absolute error for a regression model.FalseFalse['model', 'dataset']{}['regression']['regression']
validmind.unit_metrics.regression.MeanAbsolutePercentageErrorMean Absolute Percentage ErrorCalculates the mean absolute percentage error for a regression model.FalseFalse['model', 'dataset']{}['regression']['regression']
validmind.unit_metrics.regression.MeanBiasDeviationMean Bias DeviationCalculates the mean bias deviation for a regression model.FalseFalse['model', 'dataset']{}['regression']['regression']
validmind.unit_metrics.regression.MeanSquaredErrorMean Squared ErrorCalculates the mean squared error for a regression model.FalseFalse['model', 'dataset']{}['regression']['regression']
validmind.unit_metrics.regression.QuantileLossQuantile LossCalculates the quantile loss for a regression model.FalseFalse['model', 'dataset']{'quantile': {'type': '_empty', 'default': 0.5}}['regression']['regression']
validmind.unit_metrics.regression.RSquaredScoreR Squared ScoreCalculates the R-squared score for a regression model.FalseFalse['model', 'dataset']{}['regression']['regression']
validmind.unit_metrics.regression.RootMeanSquaredErrorRoot Mean Squared ErrorCalculates the root mean squared error for a regression model.FalseFalse['model', 'dataset']{}['regression']['regression']
\n" - ], - "text/plain": [ - "" - ] - }, - "execution_count": 2, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "list_tests()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "## Understand tags and task types\n", - "\n", - "Use [list_tasks()](https://docs.validmind.ai/validmind/validmind/tests.html#list_tasks) to view all unique task types used to classify tests in the ValidMind Library.\n", - "\n", - "Understanding `task` types helps you filter tests that match your model’s objective. For example:\n", - "\n", - "- **classification:** Works with Classification Models and Datasets.\n", - "- **regression:** Works with Regression Models and Datasets.\n", - "- **text classification:** Works with Text Classification Models and Datasets.\n", - "- **text summarization:** Works with Text Summarization Models and Datasets." - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "['text_qa',\n", - " 'classification',\n", - " 'data_validation',\n", - " 'text_classification',\n", - " 'feature_extraction',\n", - " 'regression',\n", - " 'visualization',\n", - " 'clustering',\n", - " 'time_series_forecasting',\n", - " 'text_summarization',\n", - " 'nlp',\n", - " 'residual_analysis',\n", - " 'monitoring',\n", - " 'text_generation']" - ] - }, - "execution_count": 3, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "list_tasks()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Use [list_tags()](https://docs.validmind.ai/validmind/validmind/tests.html#list_tags) to view all unique tags used to describe tests in the ValidMind Library.\n", - "\n", - "`Tags` describe what a test applies to and help you filter tests for your use case. Examples include:\n", - "\n", - "- **llm:** Tests that work with Large Language Models.\n", - "- **nlp:** Tests relevant for natural language processing.\n", - "- **binary_classification:** Tests for binary classification tasks.\n", - "- **forecasting:** Tests for forecasting and time-series analysis.\n", - "- **tabular_data:** Tests for tabular data like CSVs and Excel spreadsheets." - ] - }, + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Explore tests\n", + "\n", + "Explore the individual out-the-box tests available in the ValidMind Library, and identify which tests to run to evaluate different aspects of your model. Browse available tests, view their descriptions, and filter by tags or task type to find tests relevant to your use case." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "::: {.content-hidden when-format=\"html\"}\n", + "## Contents \n", + "- [About ValidMind](#toc1__) \n", + " - [Before you begin](#toc1_1__) \n", + " - [New to ValidMind?](#toc1_2__) \n", + " - [Key concepts](#toc1_3__) \n", + "- [Install the ValidMind Library](#toc2__) \n", + "- [List all available tests](#toc3__) \n", + "- [Understand tags and task types](#toc4__) \n", + "- [Filter tests by tags and task types](#toc5__) \n", + "- [Store test sets for use](#toc6__) \n", + "- [Next steps](#toc7__) \n", + " - [Discover more learning resources](#toc7_1__) \n", + "- [Upgrade ValidMind](#toc8__) \n", + "\n", + ":::\n", + "\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "## About ValidMind\n", + "\n", + "ValidMind is a suite of tools for managing model risk, including risk associated with AI and statistical models.\n", + "\n", + "You use the ValidMind Library to automate documentation and validation tests, and then use the ValidMind Platform to collaborate on model documentation. Together, these products simplify model risk management, facilitate compliance with regulations and institutional standards, and enhance collaboration between yourself and model validators.\n", + "\n", + "\n", + "\n", + "### Before you begin\n", + "\n", + "This notebook assumes you have basic familiarity with Python, including an understanding of how functions work. If you are new to Python, you can still run the notebook but we recommend further familiarizing yourself with the language. \n", + "\n", + "If you encounter errors due to missing modules in your Python environment, install the modules with `pip install`, and then re-run the notebook. For more help, refer to [Installing Python Modules](https://docs.python.org/3/installing/index.html).\n", + "\n", + "\n", + "\n", + "### New to ValidMind?\n", + "\n", + "If you haven't already seen our documentation on the [ValidMind Library](https://docs.validmind.ai/developer/validmind-library.html), we recommend you begin by exploring the available resources in this section. There, you can learn more about documenting models and running tests, as well as find code samples and our Python Library API reference.\n", + "\n", + "
For access to all features available in this notebook, you'll need access to a ValidMind account.\n", + "

\n", + "Register with ValidMind
\n", + "\n", + "\n", + "\n", + "### Key concepts\n", + "\n", + "**Model documentation**: A structured and detailed record pertaining to a model, encompassing key components such as its underlying assumptions, methodologies, data sources, inputs, performance metrics, evaluations, limitations, and intended uses. It serves to ensure transparency, adherence to regulatory requirements, and a clear understanding of potential risks associated with the model’s application.\n", + "\n", + "**Documentation template**: Functions as a test suite and lays out the structure of model documentation, segmented into various sections and sub-sections. Documentation templates define the structure of your model documentation, specifying the tests that should be run, and how the results should be displayed.\n", + "\n", + "**Tests**: A function contained in the ValidMind Library, designed to run a specific quantitative test on the dataset or model. Tests are the building blocks of ValidMind, used to evaluate and document models and datasets, and can be run individually or as part of a suite defined by your model documentation template.\n", + "\n", + "**Custom tests**: Custom tests are functions that you define to evaluate your model or dataset. These functions can be registered via the ValidMind Library to be used with the ValidMind Platform.\n", + "\n", + "**Inputs**: Objects to be evaluated and documented in the ValidMind Library. They can be any of the following:\n", + "\n", + " - **model**: A single model that has been initialized in ValidMind with [`vm.init_model()`](https://docs.validmind.ai/validmind/validmind.html#init_model).\n", + " - **dataset**: Single dataset that has been initialized in ValidMind with [`vm.init_dataset()`](https://docs.validmind.ai/validmind/validmind.html#init_dataset).\n", + " - **models**: A list of ValidMind models - usually this is used when you want to compare multiple models in your custom test.\n", + " - **datasets**: A list of ValidMind datasets - usually this is used when you want to compare multiple datasets in your custom test. See this [example](https://docs.validmind.ai/notebooks/how_to/tests/run_tests/configure_tests/run_tests_that_require_multiple_datasets.html) for more information.\n", + "\n", + "**Parameters**: Additional arguments that can be passed when running a ValidMind test, used to pass additional information to a test, customize its behavior, or provide additional context.\n", + "\n", + "**Outputs**: Custom tests can return elements like tables or plots. Tables may be a list of dictionaries (each representing a row) or a pandas DataFrame. Plots may be matplotlib or plotly figures.\n", + "\n", + "**Test suites**: Collections of tests designed to run together to automate and generate model documentation end-to-end for specific use-cases.\n", + "\n", + "Example: the [`classifier_full_suite`](https://docs.validmind.ai/validmind/validmind/test_suites/classifier.html#ClassifierFullSuite) test suite runs tests from the [`tabular_dataset`](https://docs.validmind.ai/validmind/validmind/test_suites/tabular_datasets.html) and [`classifier`](https://docs.validmind.ai/validmind/validmind/test_suites/classifier.html) test suites to fully document the data and model sections for binary classification model use-cases." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "## Install the ValidMind Library\n", + "\n", + "
Recommended Python versions\n", + "

\n", + "Python 3.8 <= x <= 3.11
\n", + "\n", + "To install the library:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%pip install -q validmind" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "## List all available tests" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Start by importing the functions from the [validmind.tests](https://docs.validmind.ai/validmind/validmind/tests.html) module for listing tests, listing tasks, listing tags, and listing tasks and tags to access these functions in the rest of this notebook:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from validmind.tests import (\n", + " list_tests,\n", + " list_tasks,\n", + " list_tags,\n", + " list_tasks_and_tags,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Use [list_tests()](https://docs.validmind.ai/validmind/validmind/tests.html#list_tests) to retrieve all available ValidMind tests, which returns a DataFrame with the following columns:\n", + "\n", + "- **ID** – A unique identifier for each test.\n", + "- **Name** – The test’s name.\n", + "- **Description** – A short summary of what the test evaluates.\n", + "- **Tags** – Keywords that describe what the test does or applies to.\n", + "- **Tasks** – The type of modeling task the test supports." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "['senstivity_analysis',\n", - " 'calibration',\n", - " 'clustering',\n", - " 'anomaly_detection',\n", - " 'nlp',\n", - " 'classification_metrics',\n", - " 'dimensionality_reduction',\n", - " 'tabular_data',\n", - " 'time_series_data',\n", - " 'model_predictions',\n", - " 'feature_selection',\n", - " 'correlation',\n", - " 'frequency_analysis',\n", - " 'embeddings',\n", - " 'regression',\n", - " 'llm',\n", - " 'statsmodels',\n", - " 'ragas',\n", - " 'model_performance',\n", - " 'model_validation',\n", - " 'rag_performance',\n", - " 'model_training',\n", - " 'qualitative',\n", - " 'classification',\n", - " 'kmeans',\n", - " 'multiclass_classification',\n", - " 'linear_regression',\n", - " 'data_quality',\n", - " 'text_data',\n", - " 'binary_classification',\n", - " 'threshold_optimization',\n", - " 'stationarity',\n", - " 'bias_and_fairness',\n", - " 'scorecard',\n", - " 'model_explainability',\n", - " 'model_comparison',\n", - " 'numerical_data',\n", - " 'sklearn',\n", - " 'model_selection',\n", - " 'retrieval_performance',\n", - " 'zero_shot',\n", - " 'statistical_test',\n", - " 'descriptive_statistics',\n", - " 'seasonality',\n", - " 'analysis',\n", - " 'data_validation',\n", - " 'data_distribution',\n", - " 'feature_importance',\n", - " 'metadata',\n", - " 'few_shot',\n", - " 'visualization',\n", - " 'credit_risk',\n", - " 'forecasting',\n", - " 'AUC',\n", - " 'logistic_regression',\n", - " 'model_diagnosis',\n", - " 'model_interpretation',\n", - " 'unit_root_test',\n", - " 'categorical_data',\n", - " 'data_analysis']" - ] - }, - "execution_count": 4, - "metadata": {}, - "output_type": "execute_result" - } + "data": { + "text/html": [ + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
IDNameDescriptionHas FigureHas TableRequired InputsParamsTagsTasks
validmind.data_validation.ACFandPACFPlotAC Fand PACF PlotAnalyzes time series data using Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) plots to...TrueFalse['dataset']{}['time_series_data', 'forecasting', 'statistical_test', 'visualization']['regression']
validmind.data_validation.ADFADFAssesses the stationarity of a time series dataset using the Augmented Dickey-Fuller (ADF) test....FalseTrue['dataset']{}['time_series_data', 'statsmodels', 'forecasting', 'statistical_test', 'stationarity']['regression']
validmind.data_validation.AutoARAuto ARAutomatically identifies the optimal Autoregressive (AR) order for a time series using BIC and AIC criteria....FalseTrue['dataset']{'max_ar_order': {'type': 'int', 'default': 3}}['time_series_data', 'statsmodels', 'forecasting', 'statistical_test']['regression']
validmind.data_validation.AutoMAAuto MAAutomatically selects the optimal Moving Average (MA) order for each variable in a time series dataset based on...FalseTrue['dataset']{'max_ma_order': {'type': 'int', 'default': 3}}['time_series_data', 'statsmodels', 'forecasting', 'statistical_test']['regression']
validmind.data_validation.AutoStationarityAuto StationarityAutomates Augmented Dickey-Fuller test to assess stationarity across multiple time series in a DataFrame....FalseTrue['dataset']{'max_order': {'type': 'int', 'default': 5}, 'threshold': {'type': 'float', 'default': 0.05}}['time_series_data', 'statsmodels', 'forecasting', 'statistical_test']['regression']
validmind.data_validation.BivariateScatterPlotsBivariate Scatter PlotsGenerates bivariate scatterplots to visually inspect relationships between pairs of numerical predictor variables...TrueFalse['dataset']{}['tabular_data', 'numerical_data', 'visualization']['classification']
validmind.data_validation.BoxPierceBox PierceDetects autocorrelation in time-series data through the Box-Pierce test to validate model performance....FalseTrue['dataset']{}['time_series_data', 'forecasting', 'statistical_test', 'statsmodels']['regression']
validmind.data_validation.ChiSquaredFeaturesTableChi Squared Features TableAssesses the statistical association between categorical features and a target variable using the Chi-Squared test....FalseTrue['dataset']{'p_threshold': {'type': '_empty', 'default': 0.05}}['tabular_data', 'categorical_data', 'statistical_test']['classification']
validmind.data_validation.ClassImbalanceClass ImbalanceEvaluates and quantifies class distribution imbalance in a dataset used by a machine learning model....TrueTrue['dataset']{'min_percent_threshold': {'type': 'int', 'default': 10}}['tabular_data', 'binary_classification', 'multiclass_classification', 'data_quality']['classification']
validmind.data_validation.DatasetDescriptionDataset DescriptionProvides comprehensive analysis and statistical summaries of each column in a machine learning model's dataset....FalseTrue['dataset']{}['tabular_data', 'time_series_data', 'text_data']['classification', 'regression', 'text_classification', 'text_summarization']
validmind.data_validation.DatasetSplitDataset SplitEvaluates and visualizes the distribution proportions among training, testing, and validation datasets of an ML...FalseTrue['datasets']{}['tabular_data', 'time_series_data', 'text_data']['classification', 'regression', 'text_classification', 'text_summarization']
validmind.data_validation.DescriptiveStatisticsDescriptive StatisticsPerforms a detailed descriptive statistical analysis of both numerical and categorical data within a model's...FalseTrue['dataset']{}['tabular_data', 'time_series_data', 'data_quality']['classification', 'regression']
validmind.data_validation.DickeyFullerGLSDickey Fuller GLSAssesses stationarity in time series data using the Dickey-Fuller GLS test to determine the order of integration....FalseTrue['dataset']{}['time_series_data', 'forecasting', 'unit_root_test']['regression']
validmind.data_validation.DuplicatesDuplicatesTests dataset for duplicate entries, ensuring model reliability via data quality verification....FalseTrue['dataset']{'min_threshold': {'type': '_empty', 'default': 1}}['tabular_data', 'data_quality', 'text_data']['classification', 'regression']
validmind.data_validation.EngleGrangerCointEngle Granger CointAssesses the degree of co-movement between pairs of time series data using the Engle-Granger cointegration test....FalseTrue['dataset']{'threshold': {'type': 'float', 'default': 0.05}}['time_series_data', 'statistical_test', 'forecasting']['regression']
validmind.data_validation.FeatureTargetCorrelationPlotFeature Target Correlation PlotVisualizes the correlation between input features and the model's target output in a color-coded horizontal bar...TrueFalse['dataset']{'fig_height': {'type': '_empty', 'default': 600}}['tabular_data', 'visualization', 'correlation']['classification', 'regression']
validmind.data_validation.HighCardinalityHigh CardinalityAssesses the number of unique values in categorical columns to detect high cardinality and potential overfitting....FalseTrue['dataset']{'num_threshold': {'type': 'int', 'default': 100}, 'percent_threshold': {'type': 'float', 'default': 0.1}, 'threshold_type': {'type': 'str', 'default': 'percent'}}['tabular_data', 'data_quality', 'categorical_data']['classification', 'regression']
validmind.data_validation.HighPearsonCorrelationHigh Pearson CorrelationIdentifies highly correlated feature pairs in a dataset suggesting feature redundancy or multicollinearity....FalseTrue['dataset']{'max_threshold': {'type': 'float', 'default': 0.3}, 'top_n_correlations': {'type': 'int', 'default': 10}, 'feature_columns': {'type': 'list', 'default': None}}['tabular_data', 'data_quality', 'correlation']['classification', 'regression']
validmind.data_validation.IQROutliersBarPlotIQR Outliers Bar PlotVisualizes outlier distribution across percentiles in numerical data using the Interquartile Range (IQR) method....TrueFalse['dataset']{'threshold': {'type': 'float', 'default': 1.5}, 'fig_width': {'type': 'int', 'default': 800}}['tabular_data', 'visualization', 'numerical_data']['classification', 'regression']
validmind.data_validation.IQROutliersTableIQR Outliers TableDetermines and summarizes outliers in numerical features using the Interquartile Range method....FalseTrue['dataset']{'threshold': {'type': 'float', 'default': 1.5}}['tabular_data', 'numerical_data']['classification', 'regression']
validmind.data_validation.IsolationForestOutliersIsolation Forest OutliersDetects outliers in a dataset using the Isolation Forest algorithm and visualizes results through scatter plots....TrueFalse['dataset']{'random_state': {'type': 'int', 'default': 0}, 'contamination': {'type': 'float', 'default': 0.1}, 'feature_columns': {'type': 'list', 'default': None}}['tabular_data', 'anomaly_detection']['classification']
validmind.data_validation.JarqueBeraJarque BeraAssesses normality of dataset features in an ML model using the Jarque-Bera test....FalseTrue['dataset']{}['tabular_data', 'data_distribution', 'statistical_test', 'statsmodels']['classification', 'regression']
validmind.data_validation.KPSSKPSSAssesses the stationarity of time-series data in a machine learning model using the KPSS unit root test....FalseTrue['dataset']{}['time_series_data', 'stationarity', 'unit_root_test', 'statsmodels']['data_validation']
validmind.data_validation.LJungBoxL Jung BoxAssesses autocorrelations in dataset features by performing a Ljung-Box test on each feature....FalseTrue['dataset']{}['time_series_data', 'forecasting', 'statistical_test', 'statsmodels']['regression']
validmind.data_validation.LaggedCorrelationHeatmapLagged Correlation HeatmapAssesses and visualizes correlation between target variable and lagged independent variables in a time-series...TrueFalse['dataset']{'num_lags': {'type': 'int', 'default': 10}}['time_series_data', 'visualization']['regression']
validmind.data_validation.MissingValuesMissing ValuesEvaluates dataset quality by ensuring missing value ratio across all features does not exceed a set threshold....FalseTrue['dataset']{'min_threshold': {'type': 'int', 'default': 1}}['tabular_data', 'data_quality']['classification', 'regression']
validmind.data_validation.MissingValuesBarPlotMissing Values Bar PlotAssesses the percentage and distribution of missing values in the dataset via a bar plot, with emphasis on...TrueFalse['dataset']{'threshold': {'type': 'int', 'default': 80}, 'fig_height': {'type': 'int', 'default': 600}}['tabular_data', 'data_quality', 'visualization']['classification', 'regression']
validmind.data_validation.MutualInformationMutual InformationCalculates mutual information scores between features and target variable to evaluate feature relevance....TrueFalse['dataset']{'min_threshold': {'type': 'float', 'default': 0.01}, 'task': {'type': 'str', 'default': 'classification'}}['feature_selection', 'data_analysis']['classification', 'regression']
validmind.data_validation.PearsonCorrelationMatrixPearson Correlation MatrixEvaluates linear dependency between numerical variables in a dataset via a Pearson Correlation coefficient heat map....TrueFalse['dataset']{}['tabular_data', 'numerical_data', 'correlation']['classification', 'regression']
validmind.data_validation.PhillipsPerronArchPhillips Perron ArchAssesses the stationarity of time series data in each feature of the ML model using the Phillips-Perron test....FalseTrue['dataset']{}['time_series_data', 'forecasting', 'statistical_test', 'unit_root_test']['regression']
validmind.data_validation.ProtectedClassesDescriptionProtected Classes DescriptionVisualizes the distribution of protected classes in the dataset relative to the target variable...TrueTrue['dataset']{'protected_classes': {'type': '_empty', 'default': None}}['bias_and_fairness', 'descriptive_statistics']['classification', 'regression']
validmind.data_validation.RollingStatsPlotRolling Stats PlotEvaluates the stationarity of time series data by plotting its rolling mean and standard deviation over a specified...TrueFalse['dataset']{'window_size': {'type': 'int', 'default': 12}}['time_series_data', 'visualization', 'stationarity']['regression']
validmind.data_validation.RunsTestRuns TestExecutes Runs Test on ML model to detect non-random patterns in output data sequence....FalseTrue['dataset']{}['tabular_data', 'statistical_test', 'statsmodels']['classification', 'regression']
validmind.data_validation.ScatterPlotScatter PlotAssesses visual relationships, patterns, and outliers among features in a dataset through scatter plot matrices....TrueFalse['dataset']{}['tabular_data', 'visualization']['classification', 'regression']
validmind.data_validation.ScoreBandDefaultRatesScore Band Default RatesAnalyzes default rates and population distribution across credit score bands....FalseTrue['dataset', 'model']{'score_column': {'type': 'str', 'default': 'score'}, 'score_bands': {'type': 'list', 'default': None}}['visualization', 'credit_risk', 'scorecard']['classification']
validmind.data_validation.SeasonalDecomposeSeasonal DecomposeAssesses patterns and seasonality in a time series dataset by decomposing its features into foundational components....TrueFalse['dataset']{'seasonal_model': {'type': 'str', 'default': 'additive'}}['time_series_data', 'seasonality', 'statsmodels']['regression']
validmind.data_validation.ShapiroWilkShapiro WilkEvaluates feature-wise normality of training data using the Shapiro-Wilk test....FalseTrue['dataset']{}['tabular_data', 'data_distribution', 'statistical_test']['classification', 'regression']
validmind.data_validation.SkewnessSkewnessEvaluates the skewness of numerical data in a dataset to check against a defined threshold, aiming to ensure data...FalseTrue['dataset']{'max_threshold': {'type': '_empty', 'default': 1}}['data_quality', 'tabular_data']['classification', 'regression']
validmind.data_validation.SpreadPlotSpread PlotAssesses potential correlations between pairs of time series variables through visualization to enhance...TrueFalse['dataset']{}['time_series_data', 'visualization']['regression']
validmind.data_validation.TabularCategoricalBarPlotsTabular Categorical Bar PlotsGenerates and visualizes bar plots for each category in categorical features to evaluate the dataset's composition....TrueFalse['dataset']{}['tabular_data', 'visualization']['classification', 'regression']
validmind.data_validation.TabularDateTimeHistogramsTabular Date Time HistogramsGenerates histograms to provide graphical insight into the distribution of time intervals in a model's datetime...TrueFalse['dataset']{}['time_series_data', 'visualization']['classification', 'regression']
validmind.data_validation.TabularDescriptionTablesTabular Description TablesSummarizes key descriptive statistics for numerical, categorical, and datetime variables in a dataset....FalseTrue['dataset']{}['tabular_data']['classification', 'regression']
validmind.data_validation.TabularNumericalHistogramsTabular Numerical HistogramsGenerates histograms for each numerical feature in a dataset to provide visual insights into data distribution and...TrueFalse['dataset']{}['tabular_data', 'visualization']['classification', 'regression']
validmind.data_validation.TargetRateBarPlotsTarget Rate Bar PlotsGenerates bar plots visualizing the default rates of categorical features for a classification machine learning...TrueFalse['dataset']{}['tabular_data', 'visualization', 'categorical_data']['classification']
validmind.data_validation.TimeSeriesDescriptionTime Series DescriptionGenerates a detailed analysis for the provided time series dataset, summarizing key statistics to identify trends,...FalseTrue['dataset']{}['time_series_data', 'analysis']['regression']
validmind.data_validation.TimeSeriesDescriptiveStatisticsTime Series Descriptive StatisticsEvaluates the descriptive statistics of a time series dataset to identify trends, patterns, and data quality issues....FalseTrue['dataset']{}['time_series_data', 'analysis']['regression']
validmind.data_validation.TimeSeriesFrequencyTime Series FrequencyEvaluates consistency of time series data frequency and generates a frequency plot....TrueTrue['dataset']{}['time_series_data']['regression']
validmind.data_validation.TimeSeriesHistogramTime Series HistogramVisualizes distribution of time-series data using histograms and Kernel Density Estimation (KDE) lines....TrueFalse['dataset']{'nbins': {'type': '_empty', 'default': 30}}['data_validation', 'visualization', 'time_series_data']['regression', 'time_series_forecasting']
validmind.data_validation.TimeSeriesLinePlotTime Series Line PlotGenerates and analyses time-series data through line plots revealing trends, patterns, anomalies over time....TrueFalse['dataset']{}['time_series_data', 'visualization']['regression']
validmind.data_validation.TimeSeriesMissingValuesTime Series Missing ValuesValidates time-series data quality by confirming the count of missing values is below a certain threshold....TrueTrue['dataset']{'min_threshold': {'type': 'int', 'default': 1}}['time_series_data']['regression']
validmind.data_validation.TimeSeriesOutliersTime Series OutliersIdentifies and visualizes outliers in time-series data using the z-score method....FalseTrue['dataset']{'zscore_threshold': {'type': 'int', 'default': 3}}['time_series_data']['regression']
validmind.data_validation.TooManyZeroValuesToo Many Zero ValuesIdentifies numerical columns in a dataset that contain an excessive number of zero values, defined by a threshold...FalseTrue['dataset']{'max_percent_threshold': {'type': 'float', 'default': 0.03}}['tabular_data']['regression', 'classification']
validmind.data_validation.UniqueRowsUnique RowsVerifies the diversity of the dataset by ensuring that the count of unique rows exceeds a prescribed threshold....FalseTrue['dataset']{'min_percent_threshold': {'type': 'float', 'default': 1}}['tabular_data']['regression', 'classification']
validmind.data_validation.WOEBinPlotsWOE Bin PlotsGenerates visualizations of Weight of Evidence (WoE) and Information Value (IV) for understanding predictive power...TrueFalse['dataset']{'breaks_adj': {'type': 'list', 'default': None}, 'fig_height': {'type': 'int', 'default': 600}, 'fig_width': {'type': 'int', 'default': 500}}['tabular_data', 'visualization', 'categorical_data']['classification']
validmind.data_validation.WOEBinTableWOE Bin TableAssesses the Weight of Evidence (WoE) and Information Value (IV) of each feature to evaluate its predictive power...FalseTrue['dataset']{'breaks_adj': {'type': 'list', 'default': None}}['tabular_data', 'categorical_data']['classification']
validmind.data_validation.ZivotAndrewsArchZivot Andrews ArchEvaluates the order of integration and stationarity of time series data using the Zivot-Andrews unit root test....FalseTrue['dataset']{}['time_series_data', 'stationarity', 'unit_root_test']['regression']
validmind.data_validation.nlp.CommonWordsCommon WordsAssesses the most frequent non-stopwords in a text column for identifying prevalent language patterns....TrueFalse['dataset']{}['nlp', 'text_data', 'visualization', 'frequency_analysis']['text_classification', 'text_summarization']
validmind.data_validation.nlp.HashtagsHashtagsAssesses hashtag frequency in a text column, highlighting usage trends and potential dataset bias or spam....TrueFalse['dataset']{'top_hashtags': {'type': 'int', 'default': 25}}['nlp', 'text_data', 'visualization', 'frequency_analysis']['text_classification', 'text_summarization']
validmind.data_validation.nlp.LanguageDetectionLanguage DetectionAssesses the diversity of languages in a textual dataset by detecting and visualizing the distribution of languages....TrueFalse['dataset']{}['nlp', 'text_data', 'visualization']['text_classification', 'text_summarization']
validmind.data_validation.nlp.MentionsMentionsCalculates and visualizes frequencies of '@' prefixed mentions in a text-based dataset for NLP model analysis....TrueFalse['dataset']{'top_mentions': {'type': 'int', 'default': 25}}['nlp', 'text_data', 'visualization', 'frequency_analysis']['text_classification', 'text_summarization']
validmind.data_validation.nlp.PolarityAndSubjectivityPolarity And SubjectivityAnalyzes the polarity and subjectivity of text data within a given dataset to visualize the sentiment distribution....TrueTrue['dataset']{'threshold_subjectivity': {'type': '_empty', 'default': 0.5}, 'threshold_polarity': {'type': '_empty', 'default': 0}}['nlp', 'text_data', 'data_validation']['nlp']
validmind.data_validation.nlp.PunctuationsPunctuationsAnalyzes and visualizes the frequency distribution of punctuation usage in a given text dataset....TrueFalse['dataset']{'count_mode': {'type': '_empty', 'default': 'token'}}['nlp', 'text_data', 'visualization', 'frequency_analysis']['text_classification', 'text_summarization', 'nlp']
validmind.data_validation.nlp.SentimentSentimentAnalyzes the sentiment of text data within a dataset using the VADER sentiment analysis tool....TrueFalse['dataset']{}['nlp', 'text_data', 'data_validation']['nlp']
validmind.data_validation.nlp.StopWordsStop WordsEvaluates and visualizes the frequency of English stop words in a text dataset against a defined threshold....TrueTrue['dataset']{'min_percent_threshold': {'type': 'float', 'default': 0.5}, 'num_words': {'type': 'int', 'default': 25}}['nlp', 'text_data', 'frequency_analysis', 'visualization']['text_classification', 'text_summarization']
validmind.data_validation.nlp.TextDescriptionText DescriptionConducts comprehensive textual analysis on a dataset using NLTK to evaluate various parameters and generate...TrueFalse['dataset']{'unwanted_tokens': {'type': 'set', 'default': {'s', 'mrs', 'us', \"''\", ' ', 'ms', 'dr', 'dollar', '``', 'mr', \"'s\", \"s'\"}}, 'lang': {'type': 'str', 'default': 'english'}}['nlp', 'text_data', 'visualization']['text_classification', 'text_summarization']
validmind.data_validation.nlp.ToxicityToxicityAssesses the toxicity of text data within a dataset to visualize the distribution of toxicity scores....TrueFalse['dataset']{}['nlp', 'text_data', 'data_validation']['nlp']
validmind.model_validation.BertScoreBert ScoreAssesses the quality of machine-generated text using BERTScore metrics and visualizes results through histograms...TrueTrue['dataset', 'model']{'evaluation_model': {'type': '_empty', 'default': 'distilbert-base-uncased'}}['nlp', 'text_data', 'visualization']['text_classification', 'text_summarization']
validmind.model_validation.BleuScoreBleu ScoreEvaluates the quality of machine-generated text using BLEU metrics and visualizes the results through histograms...TrueTrue['dataset', 'model']{}['nlp', 'text_data', 'visualization']['text_classification', 'text_summarization']
validmind.model_validation.ClusterSizeDistributionCluster Size DistributionAssesses the performance of clustering models by comparing the distribution of cluster sizes in model predictions...TrueFalse['dataset', 'model']{}['sklearn', 'model_performance']['clustering']
validmind.model_validation.ContextualRecallContextual RecallEvaluates a Natural Language Generation model's ability to generate contextually relevant and factually correct...TrueTrue['dataset', 'model']{}['nlp', 'text_data', 'visualization']['text_classification', 'text_summarization']
validmind.model_validation.FeaturesAUCFeatures AUCEvaluates the discriminatory power of each individual feature within a binary classification model by calculating...TrueFalse['dataset']{'fontsize': {'type': 'int', 'default': 12}, 'figure_height': {'type': 'int', 'default': 500}}['feature_importance', 'AUC', 'visualization']['classification']
validmind.model_validation.MeteorScoreMeteor ScoreAssesses the quality of machine-generated translations by comparing them to human-produced references using the...TrueTrue['dataset', 'model']{}['nlp', 'text_data', 'visualization']['text_classification', 'text_summarization']
validmind.model_validation.ModelMetadataModel MetadataCompare metadata of different models and generate a summary table with the results....FalseTrue['model']{}['model_training', 'metadata']['regression', 'time_series_forecasting']
validmind.model_validation.ModelPredictionResidualsModel Prediction ResidualsAssesses normality and behavior of residuals in regression models through visualization and statistical tests....TrueTrue['dataset', 'model']{'nbins': {'type': 'int', 'default': 100}, 'p_value_threshold': {'type': 'float', 'default': 0.05}, 'start_date': {'type': None, 'default': None}, 'end_date': {'type': None, 'default': None}}['regression']['residual_analysis', 'visualization']
validmind.model_validation.RegardScoreRegard ScoreAssesses the sentiment and potential biases in text generated by NLP models by computing and visualizing regard...TrueTrue['dataset', 'model']{}['nlp', 'text_data', 'visualization']['text_classification', 'text_summarization']
validmind.model_validation.RegressionResidualsPlotRegression Residuals PlotEvaluates regression model performance using residual distribution and actual vs. predicted plots....TrueFalse['model', 'dataset']{'bin_size': {'type': 'float', 'default': 0.1}}['model_performance', 'visualization']['regression']
validmind.model_validation.RougeScoreRouge ScoreAssesses the quality of machine-generated text using ROUGE metrics and visualizes the results to provide...TrueTrue['dataset', 'model']{'metric': {'type': 'str', 'default': 'rouge-1'}}['nlp', 'text_data', 'visualization']['text_classification', 'text_summarization']
validmind.model_validation.TimeSeriesPredictionWithCITime Series Prediction With CIAssesses predictive accuracy and uncertainty in time series models, highlighting breaches beyond confidence...TrueTrue['dataset', 'model']{'confidence': {'type': 'float', 'default': 0.95}}['model_predictions', 'visualization']['regression', 'time_series_forecasting']
validmind.model_validation.TimeSeriesPredictionsPlotTime Series Predictions PlotPlot actual vs predicted values for time series data and generate a visual comparison for the model....TrueFalse['dataset', 'model']{}['model_predictions', 'visualization']['regression', 'time_series_forecasting']
validmind.model_validation.TimeSeriesR2SquareBySegmentsTime Series R2 Square By SegmentsEvaluates the R-Squared values of regression models over specified time segments in time series data to assess...TrueTrue['dataset', 'model']{'segments': {'type': None, 'default': None}}['model_performance', 'sklearn']['regression', 'time_series_forecasting']
validmind.model_validation.TokenDisparityToken DisparityEvaluates the token disparity between reference and generated texts, visualizing the results through histograms and...TrueTrue['dataset', 'model']{}['nlp', 'text_data', 'visualization']['text_classification', 'text_summarization']
validmind.model_validation.ToxicityScoreToxicity ScoreAssesses the toxicity levels of texts generated by NLP models to identify and mitigate harmful or offensive content....TrueTrue['dataset', 'model']{}['nlp', 'text_data', 'visualization']['text_classification', 'text_summarization']
validmind.model_validation.embeddings.ClusterDistributionCluster DistributionAssesses the distribution of text embeddings across clusters produced by a model using KMeans clustering....TrueFalse['model', 'dataset']{'num_clusters': {'type': 'int', 'default': 5}}['llm', 'text_data', 'embeddings', 'visualization']['feature_extraction']
validmind.model_validation.embeddings.CosineSimilarityComparisonCosine Similarity ComparisonAssesses the similarity between embeddings generated by different models using Cosine Similarity, providing both...TrueTrue['dataset', 'models']{}['visualization', 'dimensionality_reduction', 'embeddings']['text_qa', 'text_generation', 'text_summarization']
validmind.model_validation.embeddings.CosineSimilarityDistributionCosine Similarity DistributionAssesses the similarity between predicted text embeddings from a model using a Cosine Similarity distribution...TrueFalse['dataset', 'model']{}['llm', 'text_data', 'embeddings', 'visualization']['feature_extraction']
validmind.model_validation.embeddings.CosineSimilarityHeatmapCosine Similarity HeatmapGenerates an interactive heatmap to visualize the cosine similarities among embeddings derived from a given model....TrueFalse['dataset', 'model']{'title': {'type': '_empty', 'default': 'Cosine Similarity Matrix'}, 'color': {'type': '_empty', 'default': 'Cosine Similarity'}, 'xaxis_title': {'type': '_empty', 'default': 'Index'}, 'yaxis_title': {'type': '_empty', 'default': 'Index'}, 'color_scale': {'type': '_empty', 'default': 'Blues'}}['visualization', 'dimensionality_reduction', 'embeddings']['text_qa', 'text_generation', 'text_summarization']
validmind.model_validation.embeddings.DescriptiveAnalyticsDescriptive AnalyticsEvaluates statistical properties of text embeddings in an ML model via mean, median, and standard deviation...TrueFalse['dataset', 'model']{}['llm', 'text_data', 'embeddings', 'visualization']['feature_extraction']
validmind.model_validation.embeddings.EmbeddingsVisualization2DEmbeddings Visualization2 DVisualizes 2D representation of text embeddings generated by a model using t-SNE technique....TrueFalse['dataset', 'model']{'cluster_column': {'type': None, 'default': None}, 'perplexity': {'type': 'int', 'default': 30}}['llm', 'text_data', 'embeddings', 'visualization']['feature_extraction']
validmind.model_validation.embeddings.EuclideanDistanceComparisonEuclidean Distance ComparisonAssesses and visualizes the dissimilarity between model embeddings using Euclidean distance, providing insights...TrueTrue['dataset', 'models']{}['visualization', 'dimensionality_reduction', 'embeddings']['text_qa', 'text_generation', 'text_summarization']
validmind.model_validation.embeddings.EuclideanDistanceHeatmapEuclidean Distance HeatmapGenerates an interactive heatmap to visualize the Euclidean distances among embeddings derived from a given model....TrueFalse['dataset', 'model']{'title': {'type': '_empty', 'default': 'Euclidean Distance Matrix'}, 'color': {'type': '_empty', 'default': 'Euclidean Distance'}, 'xaxis_title': {'type': '_empty', 'default': 'Index'}, 'yaxis_title': {'type': '_empty', 'default': 'Index'}, 'color_scale': {'type': '_empty', 'default': 'Blues'}}['visualization', 'dimensionality_reduction', 'embeddings']['text_qa', 'text_generation', 'text_summarization']
validmind.model_validation.embeddings.PCAComponentsPairwisePlotsPCA Components Pairwise PlotsGenerates scatter plots for pairwise combinations of principal component analysis (PCA) components of model...TrueFalse['dataset', 'model']{'n_components': {'type': 'int', 'default': 3}}['visualization', 'dimensionality_reduction', 'embeddings']['text_qa', 'text_generation', 'text_summarization']
validmind.model_validation.embeddings.StabilityAnalysisKeywordStability Analysis KeywordEvaluates robustness of embedding models to keyword swaps in the test dataset....TrueTrue['dataset', 'model']{'keyword_dict': {'type': None, 'default': None}, 'mean_similarity_threshold': {'type': 'float', 'default': 0.7}}['llm', 'text_data', 'embeddings', 'visualization']['feature_extraction']
validmind.model_validation.embeddings.StabilityAnalysisRandomNoiseStability Analysis Random NoiseAssesses the robustness of text embeddings models to random noise introduced via text perturbations....TrueTrue['dataset', 'model']{'probability': {'type': 'float', 'default': 0.02}, 'mean_similarity_threshold': {'type': 'float', 'default': 0.7}}['llm', 'text_data', 'embeddings', 'visualization']['feature_extraction']
validmind.model_validation.embeddings.StabilityAnalysisSynonymsStability Analysis SynonymsEvaluates the stability of text embeddings models when words in test data are replaced by their synonyms randomly....TrueTrue['dataset', 'model']{'probability': {'type': 'float', 'default': 0.02}, 'mean_similarity_threshold': {'type': 'float', 'default': 0.7}}['llm', 'text_data', 'embeddings', 'visualization']['feature_extraction']
validmind.model_validation.embeddings.StabilityAnalysisTranslationStability Analysis TranslationEvaluates robustness of text embeddings models to noise introduced by translating the original text to another...TrueTrue['dataset', 'model']{'source_lang': {'type': 'str', 'default': 'en'}, 'target_lang': {'type': 'str', 'default': 'fr'}, 'mean_similarity_threshold': {'type': 'float', 'default': 0.7}}['llm', 'text_data', 'embeddings', 'visualization']['feature_extraction']
validmind.model_validation.embeddings.TSNEComponentsPairwisePlotsTSNE Components Pairwise PlotsCreates scatter plots for pairwise combinations of t-SNE components to visualize embeddings and highlight potential...TrueFalse['dataset', 'model']{'n_components': {'type': 'int', 'default': 2}, 'perplexity': {'type': 'int', 'default': 30}, 'title': {'type': 'str', 'default': 't-SNE'}}['visualization', 'dimensionality_reduction', 'embeddings']['text_qa', 'text_generation', 'text_summarization']
validmind.model_validation.ragas.AnswerCorrectnessAnswer CorrectnessEvaluates the correctness of answers in a dataset with respect to the provided ground...TrueTrue['dataset']{'user_input_column': {'type': 'str', 'default': 'user_input'}, 'response_column': {'type': 'str', 'default': 'response'}, 'reference_column': {'type': 'str', 'default': 'reference'}, 'judge_llm': {'type': '_empty', 'default': None}, 'judge_embeddings': {'type': '_empty', 'default': None}}['ragas', 'llm']['text_qa', 'text_generation', 'text_summarization']
validmind.model_validation.ragas.AspectCriticAspect CriticEvaluates generations against the following aspects: harmfulness, maliciousness,...TrueTrue['dataset']{'user_input_column': {'type': 'str', 'default': 'user_input'}, 'response_column': {'type': 'str', 'default': 'response'}, 'retrieved_contexts_column': {'type': None, 'default': None}, 'aspects': {'type': None, 'default': ['coherence', 'conciseness', 'correctness', 'harmfulness', 'maliciousness']}, 'additional_aspects': {'type': None, 'default': None}, 'judge_llm': {'type': '_empty', 'default': None}, 'judge_embeddings': {'type': '_empty', 'default': None}}['ragas', 'llm', 'qualitative']['text_summarization', 'text_generation', 'text_qa']
validmind.model_validation.ragas.ContextEntityRecallContext Entity RecallEvaluates the context entity recall for dataset entries and visualizes the results....TrueTrue['dataset']{'retrieved_contexts_column': {'type': 'str', 'default': 'retrieved_contexts'}, 'reference_column': {'type': 'str', 'default': 'reference'}, 'judge_llm': {'type': '_empty', 'default': None}, 'judge_embeddings': {'type': '_empty', 'default': None}}['ragas', 'llm', 'retrieval_performance']['text_qa', 'text_generation', 'text_summarization']
validmind.model_validation.ragas.ContextPrecisionContext PrecisionContext Precision is a metric that evaluates whether all of the ground-truth...TrueTrue['dataset']{'user_input_column': {'type': 'str', 'default': 'user_input'}, 'retrieved_contexts_column': {'type': 'str', 'default': 'retrieved_contexts'}, 'reference_column': {'type': 'str', 'default': 'reference'}, 'judge_llm': {'type': '_empty', 'default': None}, 'judge_embeddings': {'type': '_empty', 'default': None}}['ragas', 'llm', 'retrieval_performance']['text_qa', 'text_generation', 'text_summarization', 'text_classification']
validmind.model_validation.ragas.ContextPrecisionWithoutReferenceContext Precision Without ReferenceContext Precision Without Reference is a metric used to evaluate the relevance of...TrueTrue['dataset']{'user_input_column': {'type': 'str', 'default': 'user_input'}, 'retrieved_contexts_column': {'type': 'str', 'default': 'retrieved_contexts'}, 'response_column': {'type': 'str', 'default': 'response'}, 'judge_llm': {'type': '_empty', 'default': None}, 'judge_embeddings': {'type': '_empty', 'default': None}}['ragas', 'llm', 'retrieval_performance']['text_qa', 'text_generation', 'text_summarization', 'text_classification']
validmind.model_validation.ragas.ContextRecallContext RecallContext recall measures the extent to which the retrieved context aligns with the...TrueTrue['dataset']{'user_input_column': {'type': 'str', 'default': 'user_input'}, 'retrieved_contexts_column': {'type': 'str', 'default': 'retrieved_contexts'}, 'reference_column': {'type': 'str', 'default': 'reference'}, 'judge_llm': {'type': '_empty', 'default': None}, 'judge_embeddings': {'type': '_empty', 'default': None}}['ragas', 'llm', 'retrieval_performance']['text_qa', 'text_generation', 'text_summarization', 'text_classification']
validmind.model_validation.ragas.FaithfulnessFaithfulnessEvaluates the faithfulness of the generated answers with respect to retrieved contexts....TrueTrue['dataset']{'user_input_column': {'type': 'str', 'default': 'user_input'}, 'response_column': {'type': 'str', 'default': 'response'}, 'retrieved_contexts_column': {'type': 'str', 'default': 'retrieved_contexts'}, 'judge_llm': {'type': '_empty', 'default': None}, 'judge_embeddings': {'type': '_empty', 'default': None}}['ragas', 'llm', 'rag_performance']['text_qa', 'text_generation', 'text_summarization']
validmind.model_validation.ragas.NoiseSensitivityNoise SensitivityAssesses the sensitivity of a Large Language Model (LLM) to noise in retrieved context by measuring how often it...TrueTrue['dataset']{'response_column': {'type': 'str', 'default': 'response'}, 'retrieved_contexts_column': {'type': 'str', 'default': 'retrieved_contexts'}, 'reference_column': {'type': 'str', 'default': 'reference'}, 'focus': {'type': 'str', 'default': 'relevant'}, 'user_input_column': {'type': 'str', 'default': 'user_input'}, 'judge_llm': {'type': '_empty', 'default': None}, 'judge_embeddings': {'type': '_empty', 'default': None}}['ragas', 'llm', 'rag_performance']['text_qa', 'text_generation', 'text_summarization']
validmind.model_validation.ragas.ResponseRelevancyResponse RelevancyAssesses how pertinent the generated answer is to the given prompt....TrueTrue['dataset']{'user_input_column': {'type': 'str', 'default': 'user_input'}, 'retrieved_contexts_column': {'type': 'str', 'default': None}, 'response_column': {'type': 'str', 'default': 'response'}, 'judge_llm': {'type': '_empty', 'default': None}, 'judge_embeddings': {'type': '_empty', 'default': None}}['ragas', 'llm', 'rag_performance']['text_qa', 'text_generation', 'text_summarization']
validmind.model_validation.ragas.SemanticSimilaritySemantic SimilarityCalculates the semantic similarity between generated responses and ground truths...TrueTrue['dataset']{'response_column': {'type': 'str', 'default': 'response'}, 'reference_column': {'type': 'str', 'default': 'reference'}, 'judge_llm': {'type': '_empty', 'default': None}, 'judge_embeddings': {'type': '_empty', 'default': None}}['ragas', 'llm']['text_qa', 'text_generation', 'text_summarization']
validmind.model_validation.sklearn.AdjustedMutualInformationAdjusted Mutual InformationEvaluates clustering model performance by measuring mutual information between true and predicted labels, adjusting...FalseTrue['model', 'dataset']{}['sklearn', 'model_performance', 'clustering']['clustering']
validmind.model_validation.sklearn.AdjustedRandIndexAdjusted Rand IndexMeasures the similarity between two data clusters using the Adjusted Rand Index (ARI) metric in clustering machine...FalseTrue['model', 'dataset']{}['sklearn', 'model_performance', 'clustering']['clustering']
validmind.model_validation.sklearn.CalibrationCurveCalibration CurveEvaluates the calibration of probability estimates by comparing predicted probabilities against observed...TrueFalse['model', 'dataset']{'n_bins': {'type': 'int', 'default': 10}}['sklearn', 'model_performance', 'classification']['classification']
validmind.model_validation.sklearn.ClassifierPerformanceClassifier PerformanceEvaluates performance of binary or multiclass classification models using precision, recall, F1-Score, accuracy,...FalseTrue['dataset', 'model']{'average': {'type': 'str', 'default': 'macro'}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']['classification', 'text_classification']
validmind.model_validation.sklearn.ClassifierThresholdOptimizationClassifier Threshold OptimizationAnalyzes and visualizes different threshold optimization methods for binary classification models....FalseTrue['dataset', 'model']{'methods': {'type': None, 'default': None}, 'target_recall': {'type': None, 'default': None}}['model_validation', 'threshold_optimization', 'classification_metrics']['classification']
validmind.model_validation.sklearn.ClusterCosineSimilarityCluster Cosine SimilarityMeasures the intra-cluster similarity of a clustering model using cosine similarity....FalseTrue['model', 'dataset']{}['sklearn', 'model_performance', 'clustering']['clustering']
validmind.model_validation.sklearn.ClusterPerformanceMetricsCluster Performance MetricsEvaluates the performance of clustering machine learning models using multiple established metrics....FalseTrue['model', 'dataset']{}['sklearn', 'model_performance', 'clustering']['clustering']
validmind.model_validation.sklearn.CompletenessScoreCompleteness ScoreEvaluates a clustering model's capacity to categorize instances from a single class into the same cluster....FalseTrue['model', 'dataset']{}['sklearn', 'model_performance', 'clustering']['clustering']
validmind.model_validation.sklearn.ConfusionMatrixConfusion MatrixEvaluates and visually represents the classification ML model's predictive performance using a Confusion Matrix...TrueFalse['dataset', 'model']{'threshold': {'type': 'float', 'default': 0.5}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization']['classification', 'text_classification']
validmind.model_validation.sklearn.FeatureImportanceFeature ImportanceCompute feature importance scores for a given model and generate a summary table...FalseTrue['dataset', 'model']{'num_features': {'type': 'int', 'default': 3}}['model_explainability', 'sklearn']['regression', 'time_series_forecasting']
validmind.model_validation.sklearn.FowlkesMallowsScoreFowlkes Mallows ScoreEvaluates the similarity between predicted and actual cluster assignments in a model using the Fowlkes-Mallows...FalseTrue['dataset', 'model']{}['sklearn', 'model_performance']['clustering']
validmind.model_validation.sklearn.HomogeneityScoreHomogeneity ScoreAssesses clustering homogeneity by comparing true and predicted labels, scoring from 0 (heterogeneous) to 1...FalseTrue['dataset', 'model']{}['sklearn', 'model_performance']['clustering']
validmind.model_validation.sklearn.HyperParametersTuningHyper Parameters TuningPerforms exhaustive grid search over specified parameter ranges to find optimal model configurations...FalseTrue['model', 'dataset']{'param_grid': {'type': 'dict', 'default': None}, 'scoring': {'type': None, 'default': None}, 'thresholds': {'type': None, 'default': None}, 'fit_params': {'type': 'dict', 'default': None}}['sklearn', 'model_performance']['clustering', 'classification']
validmind.model_validation.sklearn.KMeansClustersOptimizationK Means Clusters OptimizationOptimizes the number of clusters in K-means models using Elbow and Silhouette methods....TrueFalse['model', 'dataset']{'n_clusters': {'type': None, 'default': None}}['sklearn', 'model_performance', 'kmeans']['clustering']
validmind.model_validation.sklearn.MinimumAccuracyMinimum AccuracyChecks if the model's prediction accuracy meets or surpasses a specified threshold....FalseTrue['dataset', 'model']{'min_threshold': {'type': 'float', 'default': 0.7}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']['classification', 'text_classification']
validmind.model_validation.sklearn.MinimumF1ScoreMinimum F1 ScoreAssesses if the model's F1 score on the validation set meets a predefined minimum threshold, ensuring balanced...FalseTrue['dataset', 'model']{'min_threshold': {'type': 'float', 'default': 0.5}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']['classification', 'text_classification']
validmind.model_validation.sklearn.MinimumROCAUCScoreMinimum ROCAUC ScoreValidates model by checking if the ROC AUC score meets or surpasses a specified threshold....FalseTrue['dataset', 'model']{'min_threshold': {'type': 'float', 'default': 0.5}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']['classification', 'text_classification']
validmind.model_validation.sklearn.ModelParametersModel ParametersExtracts and displays model parameters in a structured format for transparency and reproducibility....FalseTrue['model']{'model_params': {'type': None, 'default': None}}['model_training', 'metadata']['classification', 'regression']
validmind.model_validation.sklearn.ModelsPerformanceComparisonModels Performance ComparisonEvaluates and compares the performance of multiple Machine Learning models using various metrics like accuracy,...FalseTrue['dataset', 'models']{}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'model_comparison']['classification', 'text_classification']
validmind.model_validation.sklearn.OverfitDiagnosisOverfit DiagnosisAssesses potential overfitting in a model's predictions, identifying regions where performance between training and...TrueTrue['model', 'datasets']{'metric': {'type': 'str', 'default': None}, 'cut_off_threshold': {'type': 'float', 'default': 0.04}}['sklearn', 'binary_classification', 'multiclass_classification', 'linear_regression', 'model_diagnosis']['classification', 'regression']
validmind.model_validation.sklearn.PermutationFeatureImportancePermutation Feature ImportanceAssesses the significance of each feature in a model by evaluating the impact on model performance when feature...TrueFalse['model', 'dataset']{'fontsize': {'type': None, 'default': None}, 'figure_height': {'type': None, 'default': None}}['sklearn', 'binary_classification', 'multiclass_classification', 'feature_importance', 'visualization']['classification', 'text_classification']
validmind.model_validation.sklearn.PopulationStabilityIndexPopulation Stability IndexAssesses the Population Stability Index (PSI) to quantify the stability of an ML model's predictions across...TrueTrue['datasets', 'model']{'num_bins': {'type': 'int', 'default': 10}, 'mode': {'type': 'str', 'default': 'fixed'}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']['classification', 'text_classification']
validmind.model_validation.sklearn.PrecisionRecallCurvePrecision Recall CurveEvaluates the precision-recall trade-off for binary classification models and visualizes the Precision-Recall curve....TrueFalse['model', 'dataset']{}['sklearn', 'binary_classification', 'model_performance', 'visualization']['classification', 'text_classification']
validmind.model_validation.sklearn.ROCCurveROC CurveEvaluates binary classification model performance by generating and plotting the Receiver Operating Characteristic...TrueFalse['model', 'dataset']{}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization']['classification', 'text_classification']
validmind.model_validation.sklearn.RegressionErrorsRegression ErrorsAssesses the performance and error distribution of a regression model using various error metrics....FalseTrue['model', 'dataset']{}['sklearn', 'model_performance']['regression', 'classification']
validmind.model_validation.sklearn.RegressionErrorsComparisonRegression Errors ComparisonAssesses multiple regression error metrics to compare model performance across different datasets, emphasizing...FalseTrue['datasets', 'models']{}['model_performance', 'sklearn']['regression', 'time_series_forecasting']
validmind.model_validation.sklearn.RegressionPerformanceRegression PerformanceEvaluates the performance of a regression model using five different metrics: MAE, MSE, RMSE, MAPE, and MBD....FalseTrue['model', 'dataset']{}['sklearn', 'model_performance']['regression']
validmind.model_validation.sklearn.RegressionR2SquareRegression R2 SquareAssesses the overall goodness-of-fit of a regression model by evaluating R-squared (R2) and Adjusted R-squared (Adj...FalseTrue['dataset', 'model']{}['sklearn', 'model_performance']['regression']
validmind.model_validation.sklearn.RegressionR2SquareComparisonRegression R2 Square ComparisonCompares R-Squared and Adjusted R-Squared values for different regression models across multiple datasets to assess...FalseTrue['datasets', 'models']{}['model_performance', 'sklearn']['regression', 'time_series_forecasting']
validmind.model_validation.sklearn.RobustnessDiagnosisRobustness DiagnosisAssesses the robustness of a machine learning model by evaluating performance decay under noisy conditions....TrueTrue['datasets', 'model']{'metric': {'type': 'str', 'default': None}, 'scaling_factor_std_dev_list': {'type': None, 'default': [0.1, 0.2, 0.3, 0.4, 0.5]}, 'performance_decay_threshold': {'type': 'float', 'default': 0.05}}['sklearn', 'model_diagnosis', 'visualization']['classification', 'regression']
validmind.model_validation.sklearn.SHAPGlobalImportanceSHAP Global ImportanceEvaluates and visualizes global feature importance using SHAP values for model explanation and risk identification....FalseTrue['model', 'dataset']{'kernel_explainer_samples': {'type': 'int', 'default': 10}, 'tree_or_linear_explainer_samples': {'type': 'int', 'default': 200}, 'class_of_interest': {'type': None, 'default': None}}['sklearn', 'binary_classification', 'multiclass_classification', 'feature_importance', 'visualization']['classification', 'text_classification']
validmind.model_validation.sklearn.ScoreProbabilityAlignmentScore Probability AlignmentAnalyzes the alignment between credit scores and predicted probabilities....TrueTrue['model', 'dataset']{'score_column': {'type': 'str', 'default': 'score'}, 'n_bins': {'type': 'int', 'default': 10}}['visualization', 'credit_risk', 'calibration']['classification']
validmind.model_validation.sklearn.SilhouettePlotSilhouette PlotCalculates and visualizes Silhouette Score, assessing the degree of data point suitability to its cluster in ML...TrueTrue['model', 'dataset']{}['sklearn', 'model_performance']['clustering']
validmind.model_validation.sklearn.TrainingTestDegradationTraining Test DegradationTests if model performance degradation between training and test datasets exceeds a predefined threshold....FalseTrue['datasets', 'model']{'max_threshold': {'type': 'float', 'default': 0.1}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization']['classification', 'text_classification']
validmind.model_validation.sklearn.VMeasureV MeasureEvaluates homogeneity and completeness of a clustering model using the V Measure Score....FalseTrue['dataset', 'model']{}['sklearn', 'model_performance']['clustering']
validmind.model_validation.sklearn.WeakspotsDiagnosisWeakspots DiagnosisIdentifies and visualizes weak spots in a machine learning model's performance across various sections of the...TrueTrue['datasets', 'model']{'features_columns': {'type': None, 'default': None}, 'metrics': {'type': None, 'default': None}, 'thresholds': {'type': None, 'default': None}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_diagnosis', 'visualization']['classification', 'text_classification']
validmind.model_validation.statsmodels.AutoARIMAAuto ARIMAEvaluates ARIMA models for time-series forecasting, ranking them using Bayesian and Akaike Information Criteria....FalseTrue['model', 'dataset']{}['time_series_data', 'forecasting', 'model_selection', 'statsmodels']['regression']
validmind.model_validation.statsmodels.CumulativePredictionProbabilitiesCumulative Prediction ProbabilitiesVisualizes cumulative probabilities of positive and negative classes for both training and testing in classification models....TrueFalse['dataset', 'model']{'title': {'type': 'str', 'default': 'Cumulative Probabilities'}}['visualization', 'credit_risk']['classification']
validmind.model_validation.statsmodels.DurbinWatsonTestDurbin Watson TestAssesses autocorrelation in time series data features using the Durbin-Watson statistic....FalseTrue['dataset', 'model']{'threshold': {'type': None, 'default': [1.5, 2.5]}}['time_series_data', 'forecasting', 'statistical_test', 'statsmodels']['regression']
validmind.model_validation.statsmodels.GINITableGINI TableEvaluates classification model performance using AUC, GINI, and KS metrics for training and test datasets....FalseTrue['dataset', 'model']{}['model_performance']['classification']
validmind.model_validation.statsmodels.KolmogorovSmirnovKolmogorov SmirnovAssesses whether each feature in the dataset aligns with a normal distribution using the Kolmogorov-Smirnov test....FalseTrue['model', 'dataset']{'dist': {'type': 'str', 'default': 'norm'}}['tabular_data', 'data_distribution', 'statistical_test', 'statsmodels']['classification', 'regression']
validmind.model_validation.statsmodels.LillieforsLillieforsAssesses the normality of feature distributions in an ML model's training dataset using the Lilliefors test....FalseTrue['dataset']{}['tabular_data', 'data_distribution', 'statistical_test', 'statsmodels']['classification', 'regression']
validmind.model_validation.statsmodels.PredictionProbabilitiesHistogramPrediction Probabilities HistogramAssesses the predictive probability distribution for binary classification to evaluate model performance and...TrueFalse['dataset', 'model']{'title': {'type': 'str', 'default': 'Histogram of Predictive Probabilities'}}['visualization', 'credit_risk']['classification']
validmind.model_validation.statsmodels.RegressionCoeffsRegression CoeffsAssesses the significance and uncertainty of predictor variables in a regression model through visualization of...TrueTrue['model']{}['tabular_data', 'visualization', 'model_training']['regression']
validmind.model_validation.statsmodels.RegressionFeatureSignificanceRegression Feature SignificanceAssesses and visualizes the statistical significance of features in a regression model....TrueFalse['model']{'fontsize': {'type': 'int', 'default': 10}, 'p_threshold': {'type': 'float', 'default': 0.05}}['statistical_test', 'model_interpretation', 'visualization', 'feature_importance']['regression']
validmind.model_validation.statsmodels.RegressionModelForecastPlotRegression Model Forecast PlotGenerates plots to visually compare the forecasted outcomes of a regression model against actual observed values over...TrueFalse['model', 'dataset']{'start_date': {'type': None, 'default': None}, 'end_date': {'type': None, 'default': None}}['time_series_data', 'forecasting', 'visualization']['regression']
validmind.model_validation.statsmodels.RegressionModelForecastPlotLevelsRegression Model Forecast Plot LevelsAssesses the alignment between forecasted and observed values in regression models through visual plots...TrueFalse['model', 'dataset']{}['time_series_data', 'forecasting', 'visualization']['regression']
validmind.model_validation.statsmodels.RegressionModelSensitivityPlotRegression Model Sensitivity PlotAssesses the sensitivity of a regression model to changes in independent variables by applying shocks and...TrueFalse['dataset', 'model']{'shocks': {'type': None, 'default': [0.1]}, 'transformation': {'type': None, 'default': None}}['senstivity_analysis', 'visualization']['regression']
validmind.model_validation.statsmodels.RegressionModelSummaryRegression Model SummaryEvaluates regression model performance using metrics including R-Squared, Adjusted R-Squared, MSE, and RMSE....FalseTrue['dataset', 'model']{}['model_performance', 'regression']['regression']
validmind.model_validation.statsmodels.RegressionPermutationFeatureImportanceRegression Permutation Feature ImportanceAssesses the significance of each feature in a model by evaluating the impact on model performance when feature...TrueFalse['dataset', 'model']{'fontsize': {'type': 'int', 'default': 12}, 'figure_height': {'type': 'int', 'default': 500}}['statsmodels', 'feature_importance', 'visualization']['regression']
validmind.model_validation.statsmodels.ScorecardHistogramScorecard HistogramThe Scorecard Histogram test evaluates the distribution of credit scores between default and non-default instances,...TrueFalse['dataset']{'title': {'type': 'str', 'default': 'Histogram of Scores'}, 'score_column': {'type': 'str', 'default': 'score'}}['visualization', 'credit_risk', 'logistic_regression']['classification']
validmind.ongoing_monitoring.CalibrationCurveDriftCalibration Curve DriftEvaluates changes in probability calibration between reference and monitoring datasets....TrueTrue['datasets', 'model']{'n_bins': {'type': 'int', 'default': 10}, 'drift_pct_threshold': {'type': 'float', 'default': 20}}['sklearn', 'binary_classification', 'model_performance', 'visualization']['classification', 'text_classification']
validmind.ongoing_monitoring.ClassDiscriminationDriftClass Discrimination DriftCompares classification discrimination metrics between reference and monitoring datasets....FalseTrue['datasets', 'model']{'drift_pct_threshold': {'type': '_empty', 'default': 20}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']['classification', 'text_classification']
validmind.ongoing_monitoring.ClassImbalanceDriftClass Imbalance DriftEvaluates drift in class distribution between reference and monitoring datasets....TrueTrue['datasets']{'drift_pct_threshold': {'type': 'float', 'default': 5.0}, 'title': {'type': 'str', 'default': 'Class Distribution Drift'}}['tabular_data', 'binary_classification', 'multiclass_classification']['classification']
validmind.ongoing_monitoring.ClassificationAccuracyDriftClassification Accuracy DriftCompares classification accuracy metrics between reference and monitoring datasets....FalseTrue['datasets', 'model']{'drift_pct_threshold': {'type': '_empty', 'default': 20}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']['classification', 'text_classification']
validmind.ongoing_monitoring.ConfusionMatrixDriftConfusion Matrix DriftCompares confusion matrix metrics between reference and monitoring datasets....FalseTrue['datasets', 'model']{'drift_pct_threshold': {'type': '_empty', 'default': 20}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']['classification', 'text_classification']
validmind.ongoing_monitoring.CumulativePredictionProbabilitiesDriftCumulative Prediction Probabilities DriftCompares cumulative prediction probability distributions between reference and monitoring datasets....TrueFalse['datasets', 'model']{}['visualization', 'credit_risk']['classification']
validmind.ongoing_monitoring.FeatureDriftFeature DriftEvaluates changes in feature distribution over time to identify potential model drift....TrueTrue['datasets']{'bins': {'type': '_empty', 'default': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]}, 'feature_columns': {'type': '_empty', 'default': None}, 'psi_threshold': {'type': '_empty', 'default': 0.2}}['visualization']['monitoring']
validmind.ongoing_monitoring.PredictionAcrossEachFeaturePrediction Across Each FeatureAssesses differences in model predictions across individual features between reference and monitoring datasets...TrueFalse['datasets', 'model']{}['visualization']['monitoring']
validmind.ongoing_monitoring.PredictionCorrelationPrediction CorrelationAssesses correlation changes between model predictions from reference and monitoring datasets to detect potential...TrueTrue['datasets', 'model']{'drift_pct_threshold': {'type': 'float', 'default': 20}}['visualization']['monitoring']
validmind.ongoing_monitoring.PredictionProbabilitiesHistogramDriftPrediction Probabilities Histogram DriftCompares prediction probability distributions between reference and monitoring datasets....TrueTrue['datasets', 'model']{'title': {'type': '_empty', 'default': 'Prediction Probabilities Histogram Drift'}, 'drift_pct_threshold': {'type': 'float', 'default': 20.0}}['visualization', 'credit_risk']['classification']
validmind.ongoing_monitoring.PredictionQuantilesAcrossFeaturesPrediction Quantiles Across FeaturesAssesses differences in model prediction distributions across individual features between reference...TrueFalse['datasets', 'model']{}['visualization']['monitoring']
validmind.ongoing_monitoring.ROCCurveDriftROC Curve DriftCompares ROC curves between reference and monitoring datasets....TrueFalse['datasets', 'model']{}['sklearn', 'binary_classification', 'model_performance', 'visualization']['classification', 'text_classification']
validmind.ongoing_monitoring.ScoreBandsDriftScore Bands DriftAnalyzes drift in population distribution and default rates across score bands....FalseTrue['datasets', 'model']{'score_column': {'type': 'str', 'default': 'score'}, 'score_bands': {'type': 'list', 'default': None}, 'drift_threshold': {'type': 'float', 'default': 20.0}}['visualization', 'credit_risk', 'scorecard']['classification']
validmind.ongoing_monitoring.ScorecardHistogramDriftScorecard Histogram DriftCompares score distributions between reference and monitoring datasets for each class....TrueTrue['datasets']{'score_column': {'type': 'str', 'default': 'score'}, 'title': {'type': 'str', 'default': 'Scorecard Histogram Drift'}, 'drift_pct_threshold': {'type': 'float', 'default': 20.0}}['visualization', 'credit_risk', 'logistic_regression']['classification']
validmind.ongoing_monitoring.TargetPredictionDistributionPlotTarget Prediction Distribution PlotAssesses differences in prediction distributions between a reference dataset and a monitoring dataset to identify...TrueTrue['datasets', 'model']{'drift_pct_threshold': {'type': 'float', 'default': 20}}['visualization']['monitoring']
validmind.prompt_validation.BiasBiasAssesses potential bias in a Large Language Model by analyzing the distribution and order of exemplars in the...FalseTrue['model']{'min_threshold': {'type': '_empty', 'default': 7}, 'judge_llm': {'type': '_empty', 'default': None}}['llm', 'few_shot']['text_classification', 'text_summarization']
validmind.prompt_validation.ClarityClarityEvaluates and scores the clarity of prompts in a Large Language Model based on specified guidelines....FalseTrue['model']{'min_threshold': {'type': '_empty', 'default': 7}, 'judge_llm': {'type': '_empty', 'default': None}}['llm', 'zero_shot', 'few_shot']['text_classification', 'text_summarization']
validmind.prompt_validation.ConcisenessConcisenessAnalyzes and grades the conciseness of prompts provided to a Large Language Model....FalseTrue['model']{'min_threshold': {'type': '_empty', 'default': 7}, 'judge_llm': {'type': '_empty', 'default': None}}['llm', 'zero_shot', 'few_shot']['text_classification', 'text_summarization']
validmind.prompt_validation.DelimitationDelimitationEvaluates the proper use of delimiters in prompts provided to Large Language Models....FalseTrue['model']{'min_threshold': {'type': '_empty', 'default': 7}, 'judge_llm': {'type': '_empty', 'default': None}}['llm', 'zero_shot', 'few_shot']['text_classification', 'text_summarization']
validmind.prompt_validation.NegativeInstructionNegative InstructionEvaluates and grades the use of affirmative, proactive language over negative instructions in LLM prompts....FalseTrue['model']{'min_threshold': {'type': '_empty', 'default': 7}, 'judge_llm': {'type': '_empty', 'default': None}}['llm', 'zero_shot', 'few_shot']['text_classification', 'text_summarization']
validmind.prompt_validation.RobustnessRobustnessAssesses the robustness of prompts provided to a Large Language Model under varying conditions and contexts. This test...FalseTrue['model', 'dataset']{'num_tests': {'type': '_empty', 'default': 10}, 'judge_llm': {'type': '_empty', 'default': None}}['llm', 'zero_shot', 'few_shot']['text_classification', 'text_summarization']
validmind.prompt_validation.SpecificitySpecificityEvaluates and scores the specificity of prompts provided to a Large Language Model (LLM), based on clarity, detail,...FalseTrue['model']{'min_threshold': {'type': '_empty', 'default': 7}, 'judge_llm': {'type': '_empty', 'default': None}}['llm', 'zero_shot', 'few_shot']['text_classification', 'text_summarization']
validmind.unit_metrics.classification.AccuracyAccuracyCalculates the accuracy of a modelFalseFalse['dataset', 'model']{}['classification']['classification']
validmind.unit_metrics.classification.F1F1Calculates the F1 score for a classification model.FalseFalse['model', 'dataset']{}['classification']['classification']
validmind.unit_metrics.classification.PrecisionPrecisionCalculates the precision for a classification model.FalseFalse['model', 'dataset']{}['classification']['classification']
validmind.unit_metrics.classification.ROC_AUCROC AUCCalculates the ROC AUC for a classification model.FalseFalse['model', 'dataset']{}['classification']['classification']
validmind.unit_metrics.classification.RecallRecallCalculates the recall for a classification model.FalseFalse['model', 'dataset']{}['classification']['classification']
validmind.unit_metrics.regression.AdjustedRSquaredScoreAdjusted R Squared ScoreCalculates the adjusted R-squared score for a regression model.FalseFalse['model', 'dataset']{}['regression']['regression']
validmind.unit_metrics.regression.GiniCoefficientGini CoefficientCalculates the Gini coefficient for a regression model.FalseFalse['dataset', 'model']{}['regression']['regression']
validmind.unit_metrics.regression.HuberLossHuber LossCalculates the Huber loss for a regression model.FalseFalse['model', 'dataset']{}['regression']['regression']
validmind.unit_metrics.regression.KolmogorovSmirnovStatisticKolmogorov Smirnov StatisticCalculates the Kolmogorov-Smirnov statistic for a regression model.FalseFalse['dataset', 'model']{}['regression']['regression']
validmind.unit_metrics.regression.MeanAbsoluteErrorMean Absolute ErrorCalculates the mean absolute error for a regression model.FalseFalse['model', 'dataset']{}['regression']['regression']
validmind.unit_metrics.regression.MeanAbsolutePercentageErrorMean Absolute Percentage ErrorCalculates the mean absolute percentage error for a regression model.FalseFalse['model', 'dataset']{}['regression']['regression']
validmind.unit_metrics.regression.MeanBiasDeviationMean Bias DeviationCalculates the mean bias deviation for a regression model.FalseFalse['model', 'dataset']{}['regression']['regression']
validmind.unit_metrics.regression.MeanSquaredErrorMean Squared ErrorCalculates the mean squared error for a regression model.FalseFalse['model', 'dataset']{}['regression']['regression']
validmind.unit_metrics.regression.QuantileLossQuantile LossCalculates the quantile loss for a regression model.FalseFalse['model', 'dataset']{'quantile': {'type': '_empty', 'default': 0.5}}['regression']['regression']
validmind.unit_metrics.regression.RSquaredScoreR Squared ScoreCalculates the R-squared score for a regression model.FalseFalse['model', 'dataset']{}['regression']['regression']
validmind.unit_metrics.regression.RootMeanSquaredErrorRoot Mean Squared ErrorCalculates the root mean squared error for a regression model.FalseFalse['model', 'dataset']{}['regression']['regression']
\n" ], - "source": [ - "list_tags()" + "text/plain": [ + "" ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Finally, to match each task type with its related tags, use the [list_tasks_and_tags()](https://docs.validmind.ai/validmind/validmind/tests.html#list_tasks_and_tags) function:" - ] - }, + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "list_tests()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "## Understand tags and task types\n", + "\n", + "Use [list_tasks()](https://docs.validmind.ai/validmind/validmind/tests.html#list_tasks) to view all unique task types used to classify tests in the ValidMind Library.\n", + "\n", + "Understanding `task` types helps you filter tests that match your model’s objective. For example:\n", + "\n", + "- **classification:** Works with Classification Models and Datasets.\n", + "- **regression:** Works with Regression Models and Datasets.\n", + "- **text classification:** Works with Text Classification Models and Datasets.\n", + "- **text summarization:** Works with Text Summarization Models and Datasets." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
TaskTags
regressionsenstivity_analysis, tabular_data, time_series_data, model_predictions, feature_selection, correlation, regression, statsmodels, model_performance, model_training, multiclass_classification, linear_regression, data_quality, text_data, model_explainability, binary_classification, stationarity, bias_and_fairness, numerical_data, sklearn, model_selection, statistical_test, descriptive_statistics, seasonality, analysis, data_validation, data_distribution, metadata, feature_importance, visualization, forecasting, model_diagnosis, model_interpretation, unit_root_test, categorical_data, data_analysis
classificationcalibration, anomaly_detection, classification_metrics, tabular_data, time_series_data, feature_selection, correlation, statsmodels, model_performance, model_validation, model_training, classification, multiclass_classification, linear_regression, data_quality, text_data, binary_classification, threshold_optimization, bias_and_fairness, scorecard, model_comparison, numerical_data, sklearn, statistical_test, descriptive_statistics, feature_importance, data_distribution, metadata, visualization, credit_risk, AUC, logistic_regression, model_diagnosis, categorical_data, data_analysis
text_classificationmodel_performance, feature_importance, multiclass_classification, few_shot, frequency_analysis, zero_shot, text_data, visualization, llm, binary_classification, ragas, model_diagnosis, model_comparison, sklearn, nlp, retrieval_performance, tabular_data, time_series_data
text_summarizationqualitative, few_shot, frequency_analysis, embeddings, zero_shot, text_data, visualization, llm, rag_performance, ragas, retrieval_performance, nlp, dimensionality_reduction, tabular_data, time_series_data
data_validationstationarity, statsmodels, unit_root_test, time_series_data
time_series_forecastingmodel_training, data_validation, metadata, visualization, model_explainability, sklearn, model_performance, model_predictions, time_series_data
nlpdata_validation, frequency_analysis, text_data, visualization, nlp
clusteringclustering, model_performance, kmeans, sklearn
residual_analysisregression
visualizationregression
feature_extractionembeddings, text_data, visualization, llm
text_qaqualitative, embeddings, visualization, llm, rag_performance, ragas, dimensionality_reduction, retrieval_performance
text_generationqualitative, embeddings, visualization, llm, rag_performance, ragas, dimensionality_reduction, retrieval_performance
monitoringvisualization
\n" - ], - "text/plain": [ - "" - ] - }, - "execution_count": 5, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "list_tasks_and_tags()" + "data": { + "text/plain": [ + "['text_qa',\n", + " 'classification',\n", + " 'data_validation',\n", + " 'text_classification',\n", + " 'feature_extraction',\n", + " 'regression',\n", + " 'visualization',\n", + " 'clustering',\n", + " 'time_series_forecasting',\n", + " 'text_summarization',\n", + " 'nlp',\n", + " 'residual_analysis',\n", + " 'monitoring',\n", + " 'text_generation']" ] - }, + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "list_tasks()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Use [list_tags()](https://docs.validmind.ai/validmind/validmind/tests.html#list_tags) to view all unique tags used to describe tests in the ValidMind Library.\n", + "\n", + "`Tags` describe what a test applies to and help you filter tests for your use case. Examples include:\n", + "\n", + "- **llm:** Tests that work with Large Language Models.\n", + "- **nlp:** Tests relevant for natural language processing.\n", + "- **binary_classification:** Tests for binary classification tasks.\n", + "- **forecasting:** Tests for forecasting and time-series analysis.\n", + "- **tabular_data:** Tests for tabular data like CSVs and Excel spreadsheets." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "## Filter tests by tags and task types\n", - "\n", - "While listing all tests is useful, you’ll often want to narrow your search. The [list_tests()](https://docs.validmind.ai/validmind/validmind/tests.html#list_tests) function supports `filter`, `task`, and `tags` parameters to assist in refining your results.\n", - "\n", - "Use the `filter` parameter to find tests that match a specific keyword, such as `sklearn`:" + "data": { + "text/plain": [ + "['senstivity_analysis',\n", + " 'calibration',\n", + " 'clustering',\n", + " 'anomaly_detection',\n", + " 'nlp',\n", + " 'classification_metrics',\n", + " 'dimensionality_reduction',\n", + " 'tabular_data',\n", + " 'time_series_data',\n", + " 'model_predictions',\n", + " 'feature_selection',\n", + " 'correlation',\n", + " 'frequency_analysis',\n", + " 'embeddings',\n", + " 'regression',\n", + " 'llm',\n", + " 'statsmodels',\n", + " 'ragas',\n", + " 'model_performance',\n", + " 'model_validation',\n", + " 'rag_performance',\n", + " 'model_training',\n", + " 'qualitative',\n", + " 'classification',\n", + " 'kmeans',\n", + " 'multiclass_classification',\n", + " 'linear_regression',\n", + " 'data_quality',\n", + " 'text_data',\n", + " 'binary_classification',\n", + " 'threshold_optimization',\n", + " 'stationarity',\n", + " 'bias_and_fairness',\n", + " 'scorecard',\n", + " 'model_explainability',\n", + " 'model_comparison',\n", + " 'numerical_data',\n", + " 'sklearn',\n", + " 'model_selection',\n", + " 'retrieval_performance',\n", + " 'zero_shot',\n", + " 'statistical_test',\n", + " 'descriptive_statistics',\n", + " 'seasonality',\n", + " 'analysis',\n", + " 'data_validation',\n", + " 'data_distribution',\n", + " 'feature_importance',\n", + " 'metadata',\n", + " 'few_shot',\n", + " 'visualization',\n", + " 'credit_risk',\n", + " 'forecasting',\n", + " 'AUC',\n", + " 'logistic_regression',\n", + " 'model_diagnosis',\n", + " 'model_interpretation',\n", + " 'unit_root_test',\n", + " 'categorical_data',\n", + " 'data_analysis']" ] - }, + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "list_tags()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Finally, to match each task type with its related tags, use the [list_tasks_and_tags()](https://docs.validmind.ai/validmind/validmind/tests.html#list_tasks_and_tags) function:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ { - "cell_type": "code", - "execution_count": 6, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
IDNameDescriptionHas FigureHas TableRequired InputsParamsTagsTasks
validmind.model_validation.ClusterSizeDistributionCluster Size DistributionAssesses the performance of clustering models by comparing the distribution of cluster sizes in model predictions...TrueFalse['dataset', 'model']{}['sklearn', 'model_performance']['clustering']
validmind.model_validation.TimeSeriesR2SquareBySegmentsTime Series R2 Square By SegmentsEvaluates the R-Squared values of regression models over specified time segments in time series data to assess...TrueTrue['dataset', 'model']{'segments': {'type': None, 'default': None}}['model_performance', 'sklearn']['regression', 'time_series_forecasting']
validmind.model_validation.sklearn.AdjustedMutualInformationAdjusted Mutual InformationEvaluates clustering model performance by measuring mutual information between true and predicted labels, adjusting...FalseTrue['model', 'dataset']{}['sklearn', 'model_performance', 'clustering']['clustering']
validmind.model_validation.sklearn.AdjustedRandIndexAdjusted Rand IndexMeasures the similarity between two data clusters using the Adjusted Rand Index (ARI) metric in clustering machine...FalseTrue['model', 'dataset']{}['sklearn', 'model_performance', 'clustering']['clustering']
validmind.model_validation.sklearn.CalibrationCurveCalibration CurveEvaluates the calibration of probability estimates by comparing predicted probabilities against observed...TrueFalse['model', 'dataset']{'n_bins': {'type': 'int', 'default': 10}}['sklearn', 'model_performance', 'classification']['classification']
validmind.model_validation.sklearn.ClassifierPerformanceClassifier PerformanceEvaluates performance of binary or multiclass classification models using precision, recall, F1-Score, accuracy,...FalseTrue['dataset', 'model']{'average': {'type': 'str', 'default': 'macro'}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']['classification', 'text_classification']
validmind.model_validation.sklearn.ClassifierThresholdOptimizationClassifier Threshold OptimizationAnalyzes and visualizes different threshold optimization methods for binary classification models....FalseTrue['dataset', 'model']{'methods': {'type': None, 'default': None}, 'target_recall': {'type': None, 'default': None}}['model_validation', 'threshold_optimization', 'classification_metrics']['classification']
validmind.model_validation.sklearn.ClusterCosineSimilarityCluster Cosine SimilarityMeasures the intra-cluster similarity of a clustering model using cosine similarity....FalseTrue['model', 'dataset']{}['sklearn', 'model_performance', 'clustering']['clustering']
validmind.model_validation.sklearn.ClusterPerformanceMetricsCluster Performance MetricsEvaluates the performance of clustering machine learning models using multiple established metrics....FalseTrue['model', 'dataset']{}['sklearn', 'model_performance', 'clustering']['clustering']
validmind.model_validation.sklearn.CompletenessScoreCompleteness ScoreEvaluates a clustering model's capacity to categorize instances from a single class into the same cluster....FalseTrue['model', 'dataset']{}['sklearn', 'model_performance', 'clustering']['clustering']
validmind.model_validation.sklearn.ConfusionMatrixConfusion MatrixEvaluates and visually represents the classification ML model's predictive performance using a Confusion Matrix...TrueFalse['dataset', 'model']{'threshold': {'type': 'float', 'default': 0.5}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization']['classification', 'text_classification']
validmind.model_validation.sklearn.FeatureImportanceFeature ImportanceCompute feature importance scores for a given model and generate a summary table...FalseTrue['dataset', 'model']{'num_features': {'type': 'int', 'default': 3}}['model_explainability', 'sklearn']['regression', 'time_series_forecasting']
validmind.model_validation.sklearn.FowlkesMallowsScoreFowlkes Mallows ScoreEvaluates the similarity between predicted and actual cluster assignments in a model using the Fowlkes-Mallows...FalseTrue['dataset', 'model']{}['sklearn', 'model_performance']['clustering']
validmind.model_validation.sklearn.HomogeneityScoreHomogeneity ScoreAssesses clustering homogeneity by comparing true and predicted labels, scoring from 0 (heterogeneous) to 1...FalseTrue['dataset', 'model']{}['sklearn', 'model_performance']['clustering']
validmind.model_validation.sklearn.HyperParametersTuningHyper Parameters TuningPerforms exhaustive grid search over specified parameter ranges to find optimal model configurations...FalseTrue['model', 'dataset']{'param_grid': {'type': 'dict', 'default': None}, 'scoring': {'type': None, 'default': None}, 'thresholds': {'type': None, 'default': None}, 'fit_params': {'type': 'dict', 'default': None}}['sklearn', 'model_performance']['clustering', 'classification']
validmind.model_validation.sklearn.KMeansClustersOptimizationK Means Clusters OptimizationOptimizes the number of clusters in K-means models using Elbow and Silhouette methods....TrueFalse['model', 'dataset']{'n_clusters': {'type': None, 'default': None}}['sklearn', 'model_performance', 'kmeans']['clustering']
validmind.model_validation.sklearn.MinimumAccuracyMinimum AccuracyChecks if the model's prediction accuracy meets or surpasses a specified threshold....FalseTrue['dataset', 'model']{'min_threshold': {'type': 'float', 'default': 0.7}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']['classification', 'text_classification']
validmind.model_validation.sklearn.MinimumF1ScoreMinimum F1 ScoreAssesses if the model's F1 score on the validation set meets a predefined minimum threshold, ensuring balanced...FalseTrue['dataset', 'model']{'min_threshold': {'type': 'float', 'default': 0.5}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']['classification', 'text_classification']
validmind.model_validation.sklearn.MinimumROCAUCScoreMinimum ROCAUC ScoreValidates model by checking if the ROC AUC score meets or surpasses a specified threshold....FalseTrue['dataset', 'model']{'min_threshold': {'type': 'float', 'default': 0.5}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']['classification', 'text_classification']
validmind.model_validation.sklearn.ModelParametersModel ParametersExtracts and displays model parameters in a structured format for transparency and reproducibility....FalseTrue['model']{'model_params': {'type': None, 'default': None}}['model_training', 'metadata']['classification', 'regression']
validmind.model_validation.sklearn.ModelsPerformanceComparisonModels Performance ComparisonEvaluates and compares the performance of multiple Machine Learning models using various metrics like accuracy,...FalseTrue['dataset', 'models']{}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'model_comparison']['classification', 'text_classification']
validmind.model_validation.sklearn.OverfitDiagnosisOverfit DiagnosisAssesses potential overfitting in a model's predictions, identifying regions where performance between training and...TrueTrue['model', 'datasets']{'metric': {'type': 'str', 'default': None}, 'cut_off_threshold': {'type': 'float', 'default': 0.04}}['sklearn', 'binary_classification', 'multiclass_classification', 'linear_regression', 'model_diagnosis']['classification', 'regression']
validmind.model_validation.sklearn.PermutationFeatureImportancePermutation Feature ImportanceAssesses the significance of each feature in a model by evaluating the impact on model performance when feature...TrueFalse['model', 'dataset']{'fontsize': {'type': None, 'default': None}, 'figure_height': {'type': None, 'default': None}}['sklearn', 'binary_classification', 'multiclass_classification', 'feature_importance', 'visualization']['classification', 'text_classification']
validmind.model_validation.sklearn.PopulationStabilityIndexPopulation Stability IndexAssesses the Population Stability Index (PSI) to quantify the stability of an ML model's predictions across...TrueTrue['datasets', 'model']{'num_bins': {'type': 'int', 'default': 10}, 'mode': {'type': 'str', 'default': 'fixed'}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']['classification', 'text_classification']
validmind.model_validation.sklearn.PrecisionRecallCurvePrecision Recall CurveEvaluates the precision-recall trade-off for binary classification models and visualizes the Precision-Recall curve....TrueFalse['model', 'dataset']{}['sklearn', 'binary_classification', 'model_performance', 'visualization']['classification', 'text_classification']
validmind.model_validation.sklearn.ROCCurveROC CurveEvaluates binary classification model performance by generating and plotting the Receiver Operating Characteristic...TrueFalse['model', 'dataset']{}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization']['classification', 'text_classification']
validmind.model_validation.sklearn.RegressionErrorsRegression ErrorsAssesses the performance and error distribution of a regression model using various error metrics....FalseTrue['model', 'dataset']{}['sklearn', 'model_performance']['regression', 'classification']
validmind.model_validation.sklearn.RegressionErrorsComparisonRegression Errors ComparisonAssesses multiple regression error metrics to compare model performance across different datasets, emphasizing...FalseTrue['datasets', 'models']{}['model_performance', 'sklearn']['regression', 'time_series_forecasting']
validmind.model_validation.sklearn.RegressionPerformanceRegression PerformanceEvaluates the performance of a regression model using five different metrics: MAE, MSE, RMSE, MAPE, and MBD....FalseTrue['model', 'dataset']{}['sklearn', 'model_performance']['regression']
validmind.model_validation.sklearn.RegressionR2SquareRegression R2 SquareAssesses the overall goodness-of-fit of a regression model by evaluating R-squared (R2) and Adjusted R-squared (Adj...FalseTrue['dataset', 'model']{}['sklearn', 'model_performance']['regression']
validmind.model_validation.sklearn.RegressionR2SquareComparisonRegression R2 Square ComparisonCompares R-Squared and Adjusted R-Squared values for different regression models across multiple datasets to assess...FalseTrue['datasets', 'models']{}['model_performance', 'sklearn']['regression', 'time_series_forecasting']
validmind.model_validation.sklearn.RobustnessDiagnosisRobustness DiagnosisAssesses the robustness of a machine learning model by evaluating performance decay under noisy conditions....TrueTrue['datasets', 'model']{'metric': {'type': 'str', 'default': None}, 'scaling_factor_std_dev_list': {'type': None, 'default': [0.1, 0.2, 0.3, 0.4, 0.5]}, 'performance_decay_threshold': {'type': 'float', 'default': 0.05}}['sklearn', 'model_diagnosis', 'visualization']['classification', 'regression']
validmind.model_validation.sklearn.SHAPGlobalImportanceSHAP Global ImportanceEvaluates and visualizes global feature importance using SHAP values for model explanation and risk identification....FalseTrue['model', 'dataset']{'kernel_explainer_samples': {'type': 'int', 'default': 10}, 'tree_or_linear_explainer_samples': {'type': 'int', 'default': 200}, 'class_of_interest': {'type': None, 'default': None}}['sklearn', 'binary_classification', 'multiclass_classification', 'feature_importance', 'visualization']['classification', 'text_classification']
validmind.model_validation.sklearn.ScoreProbabilityAlignmentScore Probability AlignmentAnalyzes the alignment between credit scores and predicted probabilities....TrueTrue['model', 'dataset']{'score_column': {'type': 'str', 'default': 'score'}, 'n_bins': {'type': 'int', 'default': 10}}['visualization', 'credit_risk', 'calibration']['classification']
validmind.model_validation.sklearn.SilhouettePlotSilhouette PlotCalculates and visualizes Silhouette Score, assessing the degree of data point suitability to its cluster in ML...TrueTrue['model', 'dataset']{}['sklearn', 'model_performance']['clustering']
validmind.model_validation.sklearn.TrainingTestDegradationTraining Test DegradationTests if model performance degradation between training and test datasets exceeds a predefined threshold....FalseTrue['datasets', 'model']{'max_threshold': {'type': 'float', 'default': 0.1}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization']['classification', 'text_classification']
validmind.model_validation.sklearn.VMeasureV MeasureEvaluates homogeneity and completeness of a clustering model using the V Measure Score....FalseTrue['dataset', 'model']{}['sklearn', 'model_performance']['clustering']
validmind.model_validation.sklearn.WeakspotsDiagnosisWeakspots DiagnosisIdentifies and visualizes weak spots in a machine learning model's performance across various sections of the...TrueTrue['datasets', 'model']{'features_columns': {'type': None, 'default': None}, 'metrics': {'type': None, 'default': None}, 'thresholds': {'type': None, 'default': None}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_diagnosis', 'visualization']['classification', 'text_classification']
validmind.ongoing_monitoring.CalibrationCurveDriftCalibration Curve DriftEvaluates changes in probability calibration between reference and monitoring datasets....TrueTrue['datasets', 'model']{'n_bins': {'type': 'int', 'default': 10}, 'drift_pct_threshold': {'type': 'float', 'default': 20}}['sklearn', 'binary_classification', 'model_performance', 'visualization']['classification', 'text_classification']
validmind.ongoing_monitoring.ClassDiscriminationDriftClass Discrimination DriftCompares classification discrimination metrics between reference and monitoring datasets....FalseTrue['datasets', 'model']{'drift_pct_threshold': {'type': '_empty', 'default': 20}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']['classification', 'text_classification']
validmind.ongoing_monitoring.ClassificationAccuracyDriftClassification Accuracy DriftCompares classification accuracy metrics between reference and monitoring datasets....FalseTrue['datasets', 'model']{'drift_pct_threshold': {'type': '_empty', 'default': 20}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']['classification', 'text_classification']
validmind.ongoing_monitoring.ConfusionMatrixDriftConfusion Matrix DriftCompares confusion matrix metrics between reference and monitoring datasets....FalseTrue['datasets', 'model']{'drift_pct_threshold': {'type': '_empty', 'default': 20}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']['classification', 'text_classification']
validmind.ongoing_monitoring.ROCCurveDriftROC Curve DriftCompares ROC curves between reference and monitoring datasets....TrueFalse['datasets', 'model']{}['sklearn', 'binary_classification', 'model_performance', 'visualization']['classification', 'text_classification']
\n" - ], - "text/plain": [ - "" - ] - }, - "execution_count": 6, - "metadata": {}, - "output_type": "execute_result" - } + "data": { + "text/html": [ + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
TaskTags
regressionsenstivity_analysis, tabular_data, time_series_data, model_predictions, feature_selection, correlation, regression, statsmodels, model_performance, model_training, multiclass_classification, linear_regression, data_quality, text_data, model_explainability, binary_classification, stationarity, bias_and_fairness, numerical_data, sklearn, model_selection, statistical_test, descriptive_statistics, seasonality, analysis, data_validation, data_distribution, metadata, feature_importance, visualization, forecasting, model_diagnosis, model_interpretation, unit_root_test, categorical_data, data_analysis
classificationcalibration, anomaly_detection, classification_metrics, tabular_data, time_series_data, feature_selection, correlation, statsmodels, model_performance, model_validation, model_training, classification, multiclass_classification, linear_regression, data_quality, text_data, binary_classification, threshold_optimization, bias_and_fairness, scorecard, model_comparison, numerical_data, sklearn, statistical_test, descriptive_statistics, feature_importance, data_distribution, metadata, visualization, credit_risk, AUC, logistic_regression, model_diagnosis, categorical_data, data_analysis
text_classificationmodel_performance, feature_importance, multiclass_classification, few_shot, frequency_analysis, zero_shot, text_data, visualization, llm, binary_classification, ragas, model_diagnosis, model_comparison, sklearn, nlp, retrieval_performance, tabular_data, time_series_data
text_summarizationqualitative, few_shot, frequency_analysis, embeddings, zero_shot, text_data, visualization, llm, rag_performance, ragas, retrieval_performance, nlp, dimensionality_reduction, tabular_data, time_series_data
data_validationstationarity, statsmodels, unit_root_test, time_series_data
time_series_forecastingmodel_training, data_validation, metadata, visualization, model_explainability, sklearn, model_performance, model_predictions, time_series_data
nlpdata_validation, frequency_analysis, text_data, visualization, nlp
clusteringclustering, model_performance, kmeans, sklearn
residual_analysisregression
visualizationregression
feature_extractionembeddings, text_data, visualization, llm
text_qaqualitative, embeddings, visualization, llm, rag_performance, ragas, dimensionality_reduction, retrieval_performance
text_generationqualitative, embeddings, visualization, llm, rag_performance, ragas, dimensionality_reduction, retrieval_performance
monitoringvisualization
\n" ], - "source": [ - "list_tests(filter=\"sklearn\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Use the `task` parameter to find tests that match a specific task type, such as `classification`:" + "text/plain": [ + "" ] - }, + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "list_tasks_and_tags()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "## Filter tests by tags and task types\n", + "\n", + "While listing all tests is useful, you’ll often want to narrow your search. The [list_tests()](https://docs.validmind.ai/validmind/validmind/tests.html#list_tests) function supports `filter`, `task`, and `tags` parameters to assist in refining your results.\n", + "\n", + "Use the `filter` parameter to find tests that match a specific keyword, such as `sklearn`:" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ { - "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
IDNameDescriptionHas FigureHas TableRequired InputsParamsTagsTasks
validmind.data_validation.BivariateScatterPlotsBivariate Scatter PlotsGenerates bivariate scatterplots to visually inspect relationships between pairs of numerical predictor variables...TrueFalse['dataset']{}['tabular_data', 'numerical_data', 'visualization']['classification']
validmind.data_validation.ChiSquaredFeaturesTableChi Squared Features TableAssesses the statistical association between categorical features and a target variable using the Chi-Squared test....FalseTrue['dataset']{'p_threshold': {'type': '_empty', 'default': 0.05}}['tabular_data', 'categorical_data', 'statistical_test']['classification']
validmind.data_validation.ClassImbalanceClass ImbalanceEvaluates and quantifies class distribution imbalance in a dataset used by a machine learning model....TrueTrue['dataset']{'min_percent_threshold': {'type': 'int', 'default': 10}}['tabular_data', 'binary_classification', 'multiclass_classification', 'data_quality']['classification']
validmind.data_validation.DatasetDescriptionDataset DescriptionProvides comprehensive analysis and statistical summaries of each column in a machine learning model's dataset....FalseTrue['dataset']{}['tabular_data', 'time_series_data', 'text_data']['classification', 'regression', 'text_classification', 'text_summarization']
validmind.data_validation.DatasetSplitDataset SplitEvaluates and visualizes the distribution proportions among training, testing, and validation datasets of an ML...FalseTrue['datasets']{}['tabular_data', 'time_series_data', 'text_data']['classification', 'regression', 'text_classification', 'text_summarization']
validmind.data_validation.DescriptiveStatisticsDescriptive StatisticsPerforms a detailed descriptive statistical analysis of both numerical and categorical data within a model's...FalseTrue['dataset']{}['tabular_data', 'time_series_data', 'data_quality']['classification', 'regression']
validmind.data_validation.DuplicatesDuplicatesTests dataset for duplicate entries, ensuring model reliability via data quality verification....FalseTrue['dataset']{'min_threshold': {'type': '_empty', 'default': 1}}['tabular_data', 'data_quality', 'text_data']['classification', 'regression']
validmind.data_validation.FeatureTargetCorrelationPlotFeature Target Correlation PlotVisualizes the correlation between input features and the model's target output in a color-coded horizontal bar...TrueFalse['dataset']{'fig_height': {'type': '_empty', 'default': 600}}['tabular_data', 'visualization', 'correlation']['classification', 'regression']
validmind.data_validation.HighCardinalityHigh CardinalityAssesses the number of unique values in categorical columns to detect high cardinality and potential overfitting....FalseTrue['dataset']{'num_threshold': {'type': 'int', 'default': 100}, 'percent_threshold': {'type': 'float', 'default': 0.1}, 'threshold_type': {'type': 'str', 'default': 'percent'}}['tabular_data', 'data_quality', 'categorical_data']['classification', 'regression']
validmind.data_validation.HighPearsonCorrelationHigh Pearson CorrelationIdentifies highly correlated feature pairs in a dataset suggesting feature redundancy or multicollinearity....FalseTrue['dataset']{'max_threshold': {'type': 'float', 'default': 0.3}, 'top_n_correlations': {'type': 'int', 'default': 10}, 'feature_columns': {'type': 'list', 'default': None}}['tabular_data', 'data_quality', 'correlation']['classification', 'regression']
validmind.data_validation.IQROutliersBarPlotIQR Outliers Bar PlotVisualizes outlier distribution across percentiles in numerical data using the Interquartile Range (IQR) method....TrueFalse['dataset']{'threshold': {'type': 'float', 'default': 1.5}, 'fig_width': {'type': 'int', 'default': 800}}['tabular_data', 'visualization', 'numerical_data']['classification', 'regression']
validmind.data_validation.IQROutliersTableIQR Outliers TableDetermines and summarizes outliers in numerical features using the Interquartile Range method....FalseTrue['dataset']{'threshold': {'type': 'float', 'default': 1.5}}['tabular_data', 'numerical_data']['classification', 'regression']
validmind.data_validation.IsolationForestOutliersIsolation Forest OutliersDetects outliers in a dataset using the Isolation Forest algorithm and visualizes results through scatter plots....TrueFalse['dataset']{'random_state': {'type': 'int', 'default': 0}, 'contamination': {'type': 'float', 'default': 0.1}, 'feature_columns': {'type': 'list', 'default': None}}['tabular_data', 'anomaly_detection']['classification']
validmind.data_validation.JarqueBeraJarque BeraAssesses normality of dataset features in an ML model using the Jarque-Bera test....FalseTrue['dataset']{}['tabular_data', 'data_distribution', 'statistical_test', 'statsmodels']['classification', 'regression']
validmind.data_validation.MissingValuesMissing ValuesEvaluates dataset quality by ensuring missing value ratio across all features does not exceed a set threshold....FalseTrue['dataset']{'min_threshold': {'type': 'int', 'default': 1}}['tabular_data', 'data_quality']['classification', 'regression']
validmind.data_validation.MissingValuesBarPlotMissing Values Bar PlotAssesses the percentage and distribution of missing values in the dataset via a bar plot, with emphasis on...TrueFalse['dataset']{'threshold': {'type': 'int', 'default': 80}, 'fig_height': {'type': 'int', 'default': 600}}['tabular_data', 'data_quality', 'visualization']['classification', 'regression']
validmind.data_validation.MutualInformationMutual InformationCalculates mutual information scores between features and target variable to evaluate feature relevance....TrueFalse['dataset']{'min_threshold': {'type': 'float', 'default': 0.01}, 'task': {'type': 'str', 'default': 'classification'}}['feature_selection', 'data_analysis']['classification', 'regression']
validmind.data_validation.PearsonCorrelationMatrixPearson Correlation MatrixEvaluates linear dependency between numerical variables in a dataset via a Pearson Correlation coefficient heat map....TrueFalse['dataset']{}['tabular_data', 'numerical_data', 'correlation']['classification', 'regression']
validmind.data_validation.ProtectedClassesDescriptionProtected Classes DescriptionVisualizes the distribution of protected classes in the dataset relative to the target variable...TrueTrue['dataset']{'protected_classes': {'type': '_empty', 'default': None}}['bias_and_fairness', 'descriptive_statistics']['classification', 'regression']
validmind.data_validation.RunsTestRuns TestExecutes Runs Test on ML model to detect non-random patterns in output data sequence....FalseTrue['dataset']{}['tabular_data', 'statistical_test', 'statsmodels']['classification', 'regression']
validmind.data_validation.ScatterPlotScatter PlotAssesses visual relationships, patterns, and outliers among features in a dataset through scatter plot matrices....TrueFalse['dataset']{}['tabular_data', 'visualization']['classification', 'regression']
validmind.data_validation.ScoreBandDefaultRatesScore Band Default RatesAnalyzes default rates and population distribution across credit score bands....FalseTrue['dataset', 'model']{'score_column': {'type': 'str', 'default': 'score'}, 'score_bands': {'type': 'list', 'default': None}}['visualization', 'credit_risk', 'scorecard']['classification']
validmind.data_validation.ShapiroWilkShapiro WilkEvaluates feature-wise normality of training data using the Shapiro-Wilk test....FalseTrue['dataset']{}['tabular_data', 'data_distribution', 'statistical_test']['classification', 'regression']
validmind.data_validation.SkewnessSkewnessEvaluates the skewness of numerical data in a dataset to check against a defined threshold, aiming to ensure data...FalseTrue['dataset']{'max_threshold': {'type': '_empty', 'default': 1}}['data_quality', 'tabular_data']['classification', 'regression']
validmind.data_validation.TabularCategoricalBarPlotsTabular Categorical Bar PlotsGenerates and visualizes bar plots for each category in categorical features to evaluate the dataset's composition....TrueFalse['dataset']{}['tabular_data', 'visualization']['classification', 'regression']
validmind.data_validation.TabularDateTimeHistogramsTabular Date Time HistogramsGenerates histograms to provide graphical insight into the distribution of time intervals in a model's datetime...TrueFalse['dataset']{}['time_series_data', 'visualization']['classification', 'regression']
validmind.data_validation.TabularDescriptionTablesTabular Description TablesSummarizes key descriptive statistics for numerical, categorical, and datetime variables in a dataset....FalseTrue['dataset']{}['tabular_data']['classification', 'regression']
validmind.data_validation.TabularNumericalHistogramsTabular Numerical HistogramsGenerates histograms for each numerical feature in a dataset to provide visual insights into data distribution and...TrueFalse['dataset']{}['tabular_data', 'visualization']['classification', 'regression']
validmind.data_validation.TargetRateBarPlotsTarget Rate Bar PlotsGenerates bar plots visualizing the default rates of categorical features for a classification machine learning...TrueFalse['dataset']{}['tabular_data', 'visualization', 'categorical_data']['classification']
validmind.data_validation.TooManyZeroValuesToo Many Zero ValuesIdentifies numerical columns in a dataset that contain an excessive number of zero values, defined by a threshold...FalseTrue['dataset']{'max_percent_threshold': {'type': 'float', 'default': 0.03}}['tabular_data']['regression', 'classification']
validmind.data_validation.UniqueRowsUnique RowsVerifies the diversity of the dataset by ensuring that the count of unique rows exceeds a prescribed threshold....FalseTrue['dataset']{'min_percent_threshold': {'type': 'float', 'default': 1}}['tabular_data']['regression', 'classification']
validmind.data_validation.WOEBinPlotsWOE Bin PlotsGenerates visualizations of Weight of Evidence (WoE) and Information Value (IV) for understanding predictive power...TrueFalse['dataset']{'breaks_adj': {'type': 'list', 'default': None}, 'fig_height': {'type': 'int', 'default': 600}, 'fig_width': {'type': 'int', 'default': 500}}['tabular_data', 'visualization', 'categorical_data']['classification']
validmind.data_validation.WOEBinTableWOE Bin TableAssesses the Weight of Evidence (WoE) and Information Value (IV) of each feature to evaluate its predictive power...FalseTrue['dataset']{'breaks_adj': {'type': 'list', 'default': None}}['tabular_data', 'categorical_data']['classification']
validmind.model_validation.FeaturesAUCFeatures AUCEvaluates the discriminatory power of each individual feature within a binary classification model by calculating...TrueFalse['dataset']{'fontsize': {'type': 'int', 'default': 12}, 'figure_height': {'type': 'int', 'default': 500}}['feature_importance', 'AUC', 'visualization']['classification']
validmind.model_validation.sklearn.CalibrationCurveCalibration CurveEvaluates the calibration of probability estimates by comparing predicted probabilities against observed...TrueFalse['model', 'dataset']{'n_bins': {'type': 'int', 'default': 10}}['sklearn', 'model_performance', 'classification']['classification']
validmind.model_validation.sklearn.ClassifierPerformanceClassifier PerformanceEvaluates performance of binary or multiclass classification models using precision, recall, F1-Score, accuracy,...FalseTrue['dataset', 'model']{'average': {'type': 'str', 'default': 'macro'}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']['classification', 'text_classification']
validmind.model_validation.sklearn.ClassifierThresholdOptimizationClassifier Threshold OptimizationAnalyzes and visualizes different threshold optimization methods for binary classification models....FalseTrue['dataset', 'model']{'methods': {'type': None, 'default': None}, 'target_recall': {'type': None, 'default': None}}['model_validation', 'threshold_optimization', 'classification_metrics']['classification']
validmind.model_validation.sklearn.ConfusionMatrixConfusion MatrixEvaluates and visually represents the classification ML model's predictive performance using a Confusion Matrix...TrueFalse['dataset', 'model']{'threshold': {'type': 'float', 'default': 0.5}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization']['classification', 'text_classification']
validmind.model_validation.sklearn.HyperParametersTuningHyper Parameters TuningPerforms exhaustive grid search over specified parameter ranges to find optimal model configurations...FalseTrue['model', 'dataset']{'param_grid': {'type': 'dict', 'default': None}, 'scoring': {'type': None, 'default': None}, 'thresholds': {'type': None, 'default': None}, 'fit_params': {'type': 'dict', 'default': None}}['sklearn', 'model_performance']['clustering', 'classification']
validmind.model_validation.sklearn.MinimumAccuracyMinimum AccuracyChecks if the model's prediction accuracy meets or surpasses a specified threshold....FalseTrue['dataset', 'model']{'min_threshold': {'type': 'float', 'default': 0.7}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']['classification', 'text_classification']
validmind.model_validation.sklearn.MinimumF1ScoreMinimum F1 ScoreAssesses if the model's F1 score on the validation set meets a predefined minimum threshold, ensuring balanced...FalseTrue['dataset', 'model']{'min_threshold': {'type': 'float', 'default': 0.5}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']['classification', 'text_classification']
validmind.model_validation.sklearn.MinimumROCAUCScoreMinimum ROCAUC ScoreValidates model by checking if the ROC AUC score meets or surpasses a specified threshold....FalseTrue['dataset', 'model']{'min_threshold': {'type': 'float', 'default': 0.5}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']['classification', 'text_classification']
validmind.model_validation.sklearn.ModelParametersModel ParametersExtracts and displays model parameters in a structured format for transparency and reproducibility....FalseTrue['model']{'model_params': {'type': None, 'default': None}}['model_training', 'metadata']['classification', 'regression']
validmind.model_validation.sklearn.ModelsPerformanceComparisonModels Performance ComparisonEvaluates and compares the performance of multiple Machine Learning models using various metrics like accuracy,...FalseTrue['dataset', 'models']{}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'model_comparison']['classification', 'text_classification']
validmind.model_validation.sklearn.OverfitDiagnosisOverfit DiagnosisAssesses potential overfitting in a model's predictions, identifying regions where performance between training and...TrueTrue['model', 'datasets']{'metric': {'type': 'str', 'default': None}, 'cut_off_threshold': {'type': 'float', 'default': 0.04}}['sklearn', 'binary_classification', 'multiclass_classification', 'linear_regression', 'model_diagnosis']['classification', 'regression']
validmind.model_validation.sklearn.PermutationFeatureImportancePermutation Feature ImportanceAssesses the significance of each feature in a model by evaluating the impact on model performance when feature...TrueFalse['model', 'dataset']{'fontsize': {'type': None, 'default': None}, 'figure_height': {'type': None, 'default': None}}['sklearn', 'binary_classification', 'multiclass_classification', 'feature_importance', 'visualization']['classification', 'text_classification']
validmind.model_validation.sklearn.PopulationStabilityIndexPopulation Stability IndexAssesses the Population Stability Index (PSI) to quantify the stability of an ML model's predictions across...TrueTrue['datasets', 'model']{'num_bins': {'type': 'int', 'default': 10}, 'mode': {'type': 'str', 'default': 'fixed'}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']['classification', 'text_classification']
validmind.model_validation.sklearn.PrecisionRecallCurvePrecision Recall CurveEvaluates the precision-recall trade-off for binary classification models and visualizes the Precision-Recall curve....TrueFalse['model', 'dataset']{}['sklearn', 'binary_classification', 'model_performance', 'visualization']['classification', 'text_classification']
validmind.model_validation.sklearn.ROCCurveROC CurveEvaluates binary classification model performance by generating and plotting the Receiver Operating Characteristic...TrueFalse['model', 'dataset']{}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization']['classification', 'text_classification']
validmind.model_validation.sklearn.RegressionErrorsRegression ErrorsAssesses the performance and error distribution of a regression model using various error metrics....FalseTrue['model', 'dataset']{}['sklearn', 'model_performance']['regression', 'classification']
validmind.model_validation.sklearn.RobustnessDiagnosisRobustness DiagnosisAssesses the robustness of a machine learning model by evaluating performance decay under noisy conditions....TrueTrue['datasets', 'model']{'metric': {'type': 'str', 'default': None}, 'scaling_factor_std_dev_list': {'type': None, 'default': [0.1, 0.2, 0.3, 0.4, 0.5]}, 'performance_decay_threshold': {'type': 'float', 'default': 0.05}}['sklearn', 'model_diagnosis', 'visualization']['classification', 'regression']
validmind.model_validation.sklearn.SHAPGlobalImportanceSHAP Global ImportanceEvaluates and visualizes global feature importance using SHAP values for model explanation and risk identification....FalseTrue['model', 'dataset']{'kernel_explainer_samples': {'type': 'int', 'default': 10}, 'tree_or_linear_explainer_samples': {'type': 'int', 'default': 200}, 'class_of_interest': {'type': None, 'default': None}}['sklearn', 'binary_classification', 'multiclass_classification', 'feature_importance', 'visualization']['classification', 'text_classification']
validmind.model_validation.sklearn.ScoreProbabilityAlignmentScore Probability AlignmentAnalyzes the alignment between credit scores and predicted probabilities....TrueTrue['model', 'dataset']{'score_column': {'type': 'str', 'default': 'score'}, 'n_bins': {'type': 'int', 'default': 10}}['visualization', 'credit_risk', 'calibration']['classification']
validmind.model_validation.sklearn.TrainingTestDegradationTraining Test DegradationTests if model performance degradation between training and test datasets exceeds a predefined threshold....FalseTrue['datasets', 'model']{'max_threshold': {'type': 'float', 'default': 0.1}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization']['classification', 'text_classification']
validmind.model_validation.sklearn.WeakspotsDiagnosisWeakspots DiagnosisIdentifies and visualizes weak spots in a machine learning model's performance across various sections of the...TrueTrue['datasets', 'model']{'features_columns': {'type': None, 'default': None}, 'metrics': {'type': None, 'default': None}, 'thresholds': {'type': None, 'default': None}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_diagnosis', 'visualization']['classification', 'text_classification']
validmind.model_validation.statsmodels.CumulativePredictionProbabilitiesCumulative Prediction ProbabilitiesVisualizes cumulative probabilities of positive and negative classes for both training and testing in classification models....TrueFalse['dataset', 'model']{'title': {'type': 'str', 'default': 'Cumulative Probabilities'}}['visualization', 'credit_risk']['classification']
validmind.model_validation.statsmodels.GINITableGINI TableEvaluates classification model performance using AUC, GINI, and KS metrics for training and test datasets....FalseTrue['dataset', 'model']{}['model_performance']['classification']
validmind.model_validation.statsmodels.KolmogorovSmirnovKolmogorov SmirnovAssesses whether each feature in the dataset aligns with a normal distribution using the Kolmogorov-Smirnov test....FalseTrue['model', 'dataset']{'dist': {'type': 'str', 'default': 'norm'}}['tabular_data', 'data_distribution', 'statistical_test', 'statsmodels']['classification', 'regression']
validmind.model_validation.statsmodels.LillieforsLillieforsAssesses the normality of feature distributions in an ML model's training dataset using the Lilliefors test....FalseTrue['dataset']{}['tabular_data', 'data_distribution', 'statistical_test', 'statsmodels']['classification', 'regression']
validmind.model_validation.statsmodels.PredictionProbabilitiesHistogramPrediction Probabilities HistogramAssesses the predictive probability distribution for binary classification to evaluate model performance and...TrueFalse['dataset', 'model']{'title': {'type': 'str', 'default': 'Histogram of Predictive Probabilities'}}['visualization', 'credit_risk']['classification']
validmind.model_validation.statsmodels.ScorecardHistogramScorecard HistogramThe Scorecard Histogram test evaluates the distribution of credit scores between default and non-default instances,...TrueFalse['dataset']{'title': {'type': 'str', 'default': 'Histogram of Scores'}, 'score_column': {'type': 'str', 'default': 'score'}}['visualization', 'credit_risk', 'logistic_regression']['classification']
validmind.ongoing_monitoring.CalibrationCurveDriftCalibration Curve DriftEvaluates changes in probability calibration between reference and monitoring datasets....TrueTrue['datasets', 'model']{'n_bins': {'type': 'int', 'default': 10}, 'drift_pct_threshold': {'type': 'float', 'default': 20}}['sklearn', 'binary_classification', 'model_performance', 'visualization']['classification', 'text_classification']
validmind.ongoing_monitoring.ClassDiscriminationDriftClass Discrimination DriftCompares classification discrimination metrics between reference and monitoring datasets....FalseTrue['datasets', 'model']{'drift_pct_threshold': {'type': '_empty', 'default': 20}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']['classification', 'text_classification']
validmind.ongoing_monitoring.ClassImbalanceDriftClass Imbalance DriftEvaluates drift in class distribution between reference and monitoring datasets....TrueTrue['datasets']{'drift_pct_threshold': {'type': 'float', 'default': 5.0}, 'title': {'type': 'str', 'default': 'Class Distribution Drift'}}['tabular_data', 'binary_classification', 'multiclass_classification']['classification']
validmind.ongoing_monitoring.ClassificationAccuracyDriftClassification Accuracy DriftCompares classification accuracy metrics between reference and monitoring datasets....FalseTrue['datasets', 'model']{'drift_pct_threshold': {'type': '_empty', 'default': 20}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']['classification', 'text_classification']
validmind.ongoing_monitoring.ConfusionMatrixDriftConfusion Matrix DriftCompares confusion matrix metrics between reference and monitoring datasets....FalseTrue['datasets', 'model']{'drift_pct_threshold': {'type': '_empty', 'default': 20}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']['classification', 'text_classification']
validmind.ongoing_monitoring.CumulativePredictionProbabilitiesDriftCumulative Prediction Probabilities DriftCompares cumulative prediction probability distributions between reference and monitoring datasets....TrueFalse['datasets', 'model']{}['visualization', 'credit_risk']['classification']
validmind.ongoing_monitoring.PredictionProbabilitiesHistogramDriftPrediction Probabilities Histogram DriftCompares prediction probability distributions between reference and monitoring datasets....TrueTrue['datasets', 'model']{'title': {'type': '_empty', 'default': 'Prediction Probabilities Histogram Drift'}, 'drift_pct_threshold': {'type': 'float', 'default': 20.0}}['visualization', 'credit_risk']['classification']
validmind.ongoing_monitoring.ROCCurveDriftROC Curve DriftCompares ROC curves between reference and monitoring datasets....TrueFalse['datasets', 'model']{}['sklearn', 'binary_classification', 'model_performance', 'visualization']['classification', 'text_classification']
validmind.ongoing_monitoring.ScoreBandsDriftScore Bands DriftAnalyzes drift in population distribution and default rates across score bands....FalseTrue['datasets', 'model']{'score_column': {'type': 'str', 'default': 'score'}, 'score_bands': {'type': 'list', 'default': None}, 'drift_threshold': {'type': 'float', 'default': 20.0}}['visualization', 'credit_risk', 'scorecard']['classification']
validmind.ongoing_monitoring.ScorecardHistogramDriftScorecard Histogram DriftCompares score distributions between reference and monitoring datasets for each class....TrueTrue['datasets']{'score_column': {'type': 'str', 'default': 'score'}, 'title': {'type': 'str', 'default': 'Scorecard Histogram Drift'}, 'drift_pct_threshold': {'type': 'float', 'default': 20.0}}['visualization', 'credit_risk', 'logistic_regression']['classification']
validmind.unit_metrics.classification.AccuracyAccuracyCalculates the accuracy of a modelFalseFalse['dataset', 'model']{}['classification']['classification']
validmind.unit_metrics.classification.F1F1Calculates the F1 score for a classification model.FalseFalse['model', 'dataset']{}['classification']['classification']
validmind.unit_metrics.classification.PrecisionPrecisionCalculates the precision for a classification model.FalseFalse['model', 'dataset']{}['classification']['classification']
validmind.unit_metrics.classification.ROC_AUCROC AUCCalculates the ROC AUC for a classification model.FalseFalse['model', 'dataset']{}['classification']['classification']
validmind.unit_metrics.classification.RecallRecallCalculates the recall for a classification model.FalseFalse['model', 'dataset']{}['classification']['classification']
\n" - ], - "text/plain": [ - "" - ] - }, - "execution_count": 7, - "metadata": {}, - "output_type": "execute_result" - } + "data": { + "text/html": [ + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
IDNameDescriptionHas FigureHas TableRequired InputsParamsTagsTasks
validmind.model_validation.ClusterSizeDistributionCluster Size DistributionAssesses the performance of clustering models by comparing the distribution of cluster sizes in model predictions...TrueFalse['dataset', 'model']{}['sklearn', 'model_performance']['clustering']
validmind.model_validation.TimeSeriesR2SquareBySegmentsTime Series R2 Square By SegmentsEvaluates the R-Squared values of regression models over specified time segments in time series data to assess...TrueTrue['dataset', 'model']{'segments': {'type': None, 'default': None}}['model_performance', 'sklearn']['regression', 'time_series_forecasting']
validmind.model_validation.sklearn.AdjustedMutualInformationAdjusted Mutual InformationEvaluates clustering model performance by measuring mutual information between true and predicted labels, adjusting...FalseTrue['model', 'dataset']{}['sklearn', 'model_performance', 'clustering']['clustering']
validmind.model_validation.sklearn.AdjustedRandIndexAdjusted Rand IndexMeasures the similarity between two data clusters using the Adjusted Rand Index (ARI) metric in clustering machine...FalseTrue['model', 'dataset']{}['sklearn', 'model_performance', 'clustering']['clustering']
validmind.model_validation.sklearn.CalibrationCurveCalibration CurveEvaluates the calibration of probability estimates by comparing predicted probabilities against observed...TrueFalse['model', 'dataset']{'n_bins': {'type': 'int', 'default': 10}}['sklearn', 'model_performance', 'classification']['classification']
validmind.model_validation.sklearn.ClassifierPerformanceClassifier PerformanceEvaluates performance of binary or multiclass classification models using precision, recall, F1-Score, accuracy,...FalseTrue['dataset', 'model']{'average': {'type': 'str', 'default': 'macro'}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']['classification', 'text_classification']
validmind.model_validation.sklearn.ClassifierThresholdOptimizationClassifier Threshold OptimizationAnalyzes and visualizes different threshold optimization methods for binary classification models....FalseTrue['dataset', 'model']{'methods': {'type': None, 'default': None}, 'target_recall': {'type': None, 'default': None}}['model_validation', 'threshold_optimization', 'classification_metrics']['classification']
validmind.model_validation.sklearn.ClusterCosineSimilarityCluster Cosine SimilarityMeasures the intra-cluster similarity of a clustering model using cosine similarity....FalseTrue['model', 'dataset']{}['sklearn', 'model_performance', 'clustering']['clustering']
validmind.model_validation.sklearn.ClusterPerformanceMetricsCluster Performance MetricsEvaluates the performance of clustering machine learning models using multiple established metrics....FalseTrue['model', 'dataset']{}['sklearn', 'model_performance', 'clustering']['clustering']
validmind.model_validation.sklearn.CompletenessScoreCompleteness ScoreEvaluates a clustering model's capacity to categorize instances from a single class into the same cluster....FalseTrue['model', 'dataset']{}['sklearn', 'model_performance', 'clustering']['clustering']
validmind.model_validation.sklearn.ConfusionMatrixConfusion MatrixEvaluates and visually represents the classification ML model's predictive performance using a Confusion Matrix...TrueFalse['dataset', 'model']{'threshold': {'type': 'float', 'default': 0.5}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization']['classification', 'text_classification']
validmind.model_validation.sklearn.FeatureImportanceFeature ImportanceCompute feature importance scores for a given model and generate a summary table...FalseTrue['dataset', 'model']{'num_features': {'type': 'int', 'default': 3}}['model_explainability', 'sklearn']['regression', 'time_series_forecasting']
validmind.model_validation.sklearn.FowlkesMallowsScoreFowlkes Mallows ScoreEvaluates the similarity between predicted and actual cluster assignments in a model using the Fowlkes-Mallows...FalseTrue['dataset', 'model']{}['sklearn', 'model_performance']['clustering']
validmind.model_validation.sklearn.HomogeneityScoreHomogeneity ScoreAssesses clustering homogeneity by comparing true and predicted labels, scoring from 0 (heterogeneous) to 1...FalseTrue['dataset', 'model']{}['sklearn', 'model_performance']['clustering']
validmind.model_validation.sklearn.HyperParametersTuningHyper Parameters TuningPerforms exhaustive grid search over specified parameter ranges to find optimal model configurations...FalseTrue['model', 'dataset']{'param_grid': {'type': 'dict', 'default': None}, 'scoring': {'type': None, 'default': None}, 'thresholds': {'type': None, 'default': None}, 'fit_params': {'type': 'dict', 'default': None}}['sklearn', 'model_performance']['clustering', 'classification']
validmind.model_validation.sklearn.KMeansClustersOptimizationK Means Clusters OptimizationOptimizes the number of clusters in K-means models using Elbow and Silhouette methods....TrueFalse['model', 'dataset']{'n_clusters': {'type': None, 'default': None}}['sklearn', 'model_performance', 'kmeans']['clustering']
validmind.model_validation.sklearn.MinimumAccuracyMinimum AccuracyChecks if the model's prediction accuracy meets or surpasses a specified threshold....FalseTrue['dataset', 'model']{'min_threshold': {'type': 'float', 'default': 0.7}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']['classification', 'text_classification']
validmind.model_validation.sklearn.MinimumF1ScoreMinimum F1 ScoreAssesses if the model's F1 score on the validation set meets a predefined minimum threshold, ensuring balanced...FalseTrue['dataset', 'model']{'min_threshold': {'type': 'float', 'default': 0.5}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']['classification', 'text_classification']
validmind.model_validation.sklearn.MinimumROCAUCScoreMinimum ROCAUC ScoreValidates model by checking if the ROC AUC score meets or surpasses a specified threshold....FalseTrue['dataset', 'model']{'min_threshold': {'type': 'float', 'default': 0.5}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']['classification', 'text_classification']
validmind.model_validation.sklearn.ModelParametersModel ParametersExtracts and displays model parameters in a structured format for transparency and reproducibility....FalseTrue['model']{'model_params': {'type': None, 'default': None}}['model_training', 'metadata']['classification', 'regression']
validmind.model_validation.sklearn.ModelsPerformanceComparisonModels Performance ComparisonEvaluates and compares the performance of multiple Machine Learning models using various metrics like accuracy,...FalseTrue['dataset', 'models']{}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'model_comparison']['classification', 'text_classification']
validmind.model_validation.sklearn.OverfitDiagnosisOverfit DiagnosisAssesses potential overfitting in a model's predictions, identifying regions where performance between training and...TrueTrue['model', 'datasets']{'metric': {'type': 'str', 'default': None}, 'cut_off_threshold': {'type': 'float', 'default': 0.04}}['sklearn', 'binary_classification', 'multiclass_classification', 'linear_regression', 'model_diagnosis']['classification', 'regression']
validmind.model_validation.sklearn.PermutationFeatureImportancePermutation Feature ImportanceAssesses the significance of each feature in a model by evaluating the impact on model performance when feature...TrueFalse['model', 'dataset']{'fontsize': {'type': None, 'default': None}, 'figure_height': {'type': None, 'default': None}}['sklearn', 'binary_classification', 'multiclass_classification', 'feature_importance', 'visualization']['classification', 'text_classification']
validmind.model_validation.sklearn.PopulationStabilityIndexPopulation Stability IndexAssesses the Population Stability Index (PSI) to quantify the stability of an ML model's predictions across...TrueTrue['datasets', 'model']{'num_bins': {'type': 'int', 'default': 10}, 'mode': {'type': 'str', 'default': 'fixed'}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']['classification', 'text_classification']
validmind.model_validation.sklearn.PrecisionRecallCurvePrecision Recall CurveEvaluates the precision-recall trade-off for binary classification models and visualizes the Precision-Recall curve....TrueFalse['model', 'dataset']{}['sklearn', 'binary_classification', 'model_performance', 'visualization']['classification', 'text_classification']
validmind.model_validation.sklearn.ROCCurveROC CurveEvaluates binary classification model performance by generating and plotting the Receiver Operating Characteristic...TrueFalse['model', 'dataset']{}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization']['classification', 'text_classification']
validmind.model_validation.sklearn.RegressionErrorsRegression ErrorsAssesses the performance and error distribution of a regression model using various error metrics....FalseTrue['model', 'dataset']{}['sklearn', 'model_performance']['regression', 'classification']
validmind.model_validation.sklearn.RegressionErrorsComparisonRegression Errors ComparisonAssesses multiple regression error metrics to compare model performance across different datasets, emphasizing...FalseTrue['datasets', 'models']{}['model_performance', 'sklearn']['regression', 'time_series_forecasting']
validmind.model_validation.sklearn.RegressionPerformanceRegression PerformanceEvaluates the performance of a regression model using five different metrics: MAE, MSE, RMSE, MAPE, and MBD....FalseTrue['model', 'dataset']{}['sklearn', 'model_performance']['regression']
validmind.model_validation.sklearn.RegressionR2SquareRegression R2 SquareAssesses the overall goodness-of-fit of a regression model by evaluating R-squared (R2) and Adjusted R-squared (Adj...FalseTrue['dataset', 'model']{}['sklearn', 'model_performance']['regression']
validmind.model_validation.sklearn.RegressionR2SquareComparisonRegression R2 Square ComparisonCompares R-Squared and Adjusted R-Squared values for different regression models across multiple datasets to assess...FalseTrue['datasets', 'models']{}['model_performance', 'sklearn']['regression', 'time_series_forecasting']
validmind.model_validation.sklearn.RobustnessDiagnosisRobustness DiagnosisAssesses the robustness of a machine learning model by evaluating performance decay under noisy conditions....TrueTrue['datasets', 'model']{'metric': {'type': 'str', 'default': None}, 'scaling_factor_std_dev_list': {'type': None, 'default': [0.1, 0.2, 0.3, 0.4, 0.5]}, 'performance_decay_threshold': {'type': 'float', 'default': 0.05}}['sklearn', 'model_diagnosis', 'visualization']['classification', 'regression']
validmind.model_validation.sklearn.SHAPGlobalImportanceSHAP Global ImportanceEvaluates and visualizes global feature importance using SHAP values for model explanation and risk identification....FalseTrue['model', 'dataset']{'kernel_explainer_samples': {'type': 'int', 'default': 10}, 'tree_or_linear_explainer_samples': {'type': 'int', 'default': 200}, 'class_of_interest': {'type': None, 'default': None}}['sklearn', 'binary_classification', 'multiclass_classification', 'feature_importance', 'visualization']['classification', 'text_classification']
validmind.model_validation.sklearn.ScoreProbabilityAlignmentScore Probability AlignmentAnalyzes the alignment between credit scores and predicted probabilities....TrueTrue['model', 'dataset']{'score_column': {'type': 'str', 'default': 'score'}, 'n_bins': {'type': 'int', 'default': 10}}['visualization', 'credit_risk', 'calibration']['classification']
validmind.model_validation.sklearn.SilhouettePlotSilhouette PlotCalculates and visualizes Silhouette Score, assessing the degree of data point suitability to its cluster in ML...TrueTrue['model', 'dataset']{}['sklearn', 'model_performance']['clustering']
validmind.model_validation.sklearn.TrainingTestDegradationTraining Test DegradationTests if model performance degradation between training and test datasets exceeds a predefined threshold....FalseTrue['datasets', 'model']{'max_threshold': {'type': 'float', 'default': 0.1}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization']['classification', 'text_classification']
validmind.model_validation.sklearn.VMeasureV MeasureEvaluates homogeneity and completeness of a clustering model using the V Measure Score....FalseTrue['dataset', 'model']{}['sklearn', 'model_performance']['clustering']
validmind.model_validation.sklearn.WeakspotsDiagnosisWeakspots DiagnosisIdentifies and visualizes weak spots in a machine learning model's performance across various sections of the...TrueTrue['datasets', 'model']{'features_columns': {'type': None, 'default': None}, 'metrics': {'type': None, 'default': None}, 'thresholds': {'type': None, 'default': None}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_diagnosis', 'visualization']['classification', 'text_classification']
validmind.ongoing_monitoring.CalibrationCurveDriftCalibration Curve DriftEvaluates changes in probability calibration between reference and monitoring datasets....TrueTrue['datasets', 'model']{'n_bins': {'type': 'int', 'default': 10}, 'drift_pct_threshold': {'type': 'float', 'default': 20}}['sklearn', 'binary_classification', 'model_performance', 'visualization']['classification', 'text_classification']
validmind.ongoing_monitoring.ClassDiscriminationDriftClass Discrimination DriftCompares classification discrimination metrics between reference and monitoring datasets....FalseTrue['datasets', 'model']{'drift_pct_threshold': {'type': '_empty', 'default': 20}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']['classification', 'text_classification']
validmind.ongoing_monitoring.ClassificationAccuracyDriftClassification Accuracy DriftCompares classification accuracy metrics between reference and monitoring datasets....FalseTrue['datasets', 'model']{'drift_pct_threshold': {'type': '_empty', 'default': 20}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']['classification', 'text_classification']
validmind.ongoing_monitoring.ConfusionMatrixDriftConfusion Matrix DriftCompares confusion matrix metrics between reference and monitoring datasets....FalseTrue['datasets', 'model']{'drift_pct_threshold': {'type': '_empty', 'default': 20}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']['classification', 'text_classification']
validmind.ongoing_monitoring.ROCCurveDriftROC Curve DriftCompares ROC curves between reference and monitoring datasets....TrueFalse['datasets', 'model']{}['sklearn', 'binary_classification', 'model_performance', 'visualization']['classification', 'text_classification']
\n" ], - "source": [ - "list_tests(task=\"classification\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Use the `tags` parameter to find tests based on their tags, such as `model_performance` or `visualization`:" + "text/plain": [ + "" ] - }, + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "list_tests(filter=\"sklearn\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Use the `task` parameter to find tests that match a specific task type, such as `classification`:" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
IDNameDescriptionHas FigureHas TableRequired InputsParamsTagsTasks
validmind.model_validation.RegressionResidualsPlotRegression Residuals PlotEvaluates regression model performance using residual distribution and actual vs. predicted plots....TrueFalse['model', 'dataset']{'bin_size': {'type': 'float', 'default': 0.1}}['model_performance', 'visualization']['regression']
validmind.model_validation.sklearn.ConfusionMatrixConfusion MatrixEvaluates and visually represents the classification ML model's predictive performance using a Confusion Matrix...TrueFalse['dataset', 'model']{'threshold': {'type': 'float', 'default': 0.5}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization']['classification', 'text_classification']
validmind.model_validation.sklearn.PrecisionRecallCurvePrecision Recall CurveEvaluates the precision-recall trade-off for binary classification models and visualizes the Precision-Recall curve....TrueFalse['model', 'dataset']{}['sklearn', 'binary_classification', 'model_performance', 'visualization']['classification', 'text_classification']
validmind.model_validation.sklearn.ROCCurveROC CurveEvaluates binary classification model performance by generating and plotting the Receiver Operating Characteristic...TrueFalse['model', 'dataset']{}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization']['classification', 'text_classification']
validmind.model_validation.sklearn.TrainingTestDegradationTraining Test DegradationTests if model performance degradation between training and test datasets exceeds a predefined threshold....FalseTrue['datasets', 'model']{'max_threshold': {'type': 'float', 'default': 0.1}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization']['classification', 'text_classification']
validmind.ongoing_monitoring.CalibrationCurveDriftCalibration Curve DriftEvaluates changes in probability calibration between reference and monitoring datasets....TrueTrue['datasets', 'model']{'n_bins': {'type': 'int', 'default': 10}, 'drift_pct_threshold': {'type': 'float', 'default': 20}}['sklearn', 'binary_classification', 'model_performance', 'visualization']['classification', 'text_classification']
validmind.ongoing_monitoring.ROCCurveDriftROC Curve DriftCompares ROC curves between reference and monitoring datasets....TrueFalse['datasets', 'model']{}['sklearn', 'binary_classification', 'model_performance', 'visualization']['classification', 'text_classification']
\n" - ], - "text/plain": [ - "" - ] - }, - "execution_count": 8, - "metadata": {}, - "output_type": "execute_result" - } + "data": { + "text/html": [ + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
IDNameDescriptionHas FigureHas TableRequired InputsParamsTagsTasks
validmind.data_validation.BivariateScatterPlotsBivariate Scatter PlotsGenerates bivariate scatterplots to visually inspect relationships between pairs of numerical predictor variables...TrueFalse['dataset']{}['tabular_data', 'numerical_data', 'visualization']['classification']
validmind.data_validation.ChiSquaredFeaturesTableChi Squared Features TableAssesses the statistical association between categorical features and a target variable using the Chi-Squared test....FalseTrue['dataset']{'p_threshold': {'type': '_empty', 'default': 0.05}}['tabular_data', 'categorical_data', 'statistical_test']['classification']
validmind.data_validation.ClassImbalanceClass ImbalanceEvaluates and quantifies class distribution imbalance in a dataset used by a machine learning model....TrueTrue['dataset']{'min_percent_threshold': {'type': 'int', 'default': 10}}['tabular_data', 'binary_classification', 'multiclass_classification', 'data_quality']['classification']
validmind.data_validation.DatasetDescriptionDataset DescriptionProvides comprehensive analysis and statistical summaries of each column in a machine learning model's dataset....FalseTrue['dataset']{}['tabular_data', 'time_series_data', 'text_data']['classification', 'regression', 'text_classification', 'text_summarization']
validmind.data_validation.DatasetSplitDataset SplitEvaluates and visualizes the distribution proportions among training, testing, and validation datasets of an ML...FalseTrue['datasets']{}['tabular_data', 'time_series_data', 'text_data']['classification', 'regression', 'text_classification', 'text_summarization']
validmind.data_validation.DescriptiveStatisticsDescriptive StatisticsPerforms a detailed descriptive statistical analysis of both numerical and categorical data within a model's...FalseTrue['dataset']{}['tabular_data', 'time_series_data', 'data_quality']['classification', 'regression']
validmind.data_validation.DuplicatesDuplicatesTests dataset for duplicate entries, ensuring model reliability via data quality verification....FalseTrue['dataset']{'min_threshold': {'type': '_empty', 'default': 1}}['tabular_data', 'data_quality', 'text_data']['classification', 'regression']
validmind.data_validation.FeatureTargetCorrelationPlotFeature Target Correlation PlotVisualizes the correlation between input features and the model's target output in a color-coded horizontal bar...TrueFalse['dataset']{'fig_height': {'type': '_empty', 'default': 600}}['tabular_data', 'visualization', 'correlation']['classification', 'regression']
validmind.data_validation.HighCardinalityHigh CardinalityAssesses the number of unique values in categorical columns to detect high cardinality and potential overfitting....FalseTrue['dataset']{'num_threshold': {'type': 'int', 'default': 100}, 'percent_threshold': {'type': 'float', 'default': 0.1}, 'threshold_type': {'type': 'str', 'default': 'percent'}}['tabular_data', 'data_quality', 'categorical_data']['classification', 'regression']
validmind.data_validation.HighPearsonCorrelationHigh Pearson CorrelationIdentifies highly correlated feature pairs in a dataset suggesting feature redundancy or multicollinearity....FalseTrue['dataset']{'max_threshold': {'type': 'float', 'default': 0.3}, 'top_n_correlations': {'type': 'int', 'default': 10}, 'feature_columns': {'type': 'list', 'default': None}}['tabular_data', 'data_quality', 'correlation']['classification', 'regression']
validmind.data_validation.IQROutliersBarPlotIQR Outliers Bar PlotVisualizes outlier distribution across percentiles in numerical data using the Interquartile Range (IQR) method....TrueFalse['dataset']{'threshold': {'type': 'float', 'default': 1.5}, 'fig_width': {'type': 'int', 'default': 800}}['tabular_data', 'visualization', 'numerical_data']['classification', 'regression']
validmind.data_validation.IQROutliersTableIQR Outliers TableDetermines and summarizes outliers in numerical features using the Interquartile Range method....FalseTrue['dataset']{'threshold': {'type': 'float', 'default': 1.5}}['tabular_data', 'numerical_data']['classification', 'regression']
validmind.data_validation.IsolationForestOutliersIsolation Forest OutliersDetects outliers in a dataset using the Isolation Forest algorithm and visualizes results through scatter plots....TrueFalse['dataset']{'random_state': {'type': 'int', 'default': 0}, 'contamination': {'type': 'float', 'default': 0.1}, 'feature_columns': {'type': 'list', 'default': None}}['tabular_data', 'anomaly_detection']['classification']
validmind.data_validation.JarqueBeraJarque BeraAssesses normality of dataset features in an ML model using the Jarque-Bera test....FalseTrue['dataset']{}['tabular_data', 'data_distribution', 'statistical_test', 'statsmodels']['classification', 'regression']
validmind.data_validation.MissingValuesMissing ValuesEvaluates dataset quality by ensuring missing value ratio across all features does not exceed a set threshold....FalseTrue['dataset']{'min_threshold': {'type': 'int', 'default': 1}}['tabular_data', 'data_quality']['classification', 'regression']
validmind.data_validation.MissingValuesBarPlotMissing Values Bar PlotAssesses the percentage and distribution of missing values in the dataset via a bar plot, with emphasis on...TrueFalse['dataset']{'threshold': {'type': 'int', 'default': 80}, 'fig_height': {'type': 'int', 'default': 600}}['tabular_data', 'data_quality', 'visualization']['classification', 'regression']
validmind.data_validation.MutualInformationMutual InformationCalculates mutual information scores between features and target variable to evaluate feature relevance....TrueFalse['dataset']{'min_threshold': {'type': 'float', 'default': 0.01}, 'task': {'type': 'str', 'default': 'classification'}}['feature_selection', 'data_analysis']['classification', 'regression']
validmind.data_validation.PearsonCorrelationMatrixPearson Correlation MatrixEvaluates linear dependency between numerical variables in a dataset via a Pearson Correlation coefficient heat map....TrueFalse['dataset']{}['tabular_data', 'numerical_data', 'correlation']['classification', 'regression']
validmind.data_validation.ProtectedClassesDescriptionProtected Classes DescriptionVisualizes the distribution of protected classes in the dataset relative to the target variable...TrueTrue['dataset']{'protected_classes': {'type': '_empty', 'default': None}}['bias_and_fairness', 'descriptive_statistics']['classification', 'regression']
validmind.data_validation.RunsTestRuns TestExecutes Runs Test on ML model to detect non-random patterns in output data sequence....FalseTrue['dataset']{}['tabular_data', 'statistical_test', 'statsmodels']['classification', 'regression']
validmind.data_validation.ScatterPlotScatter PlotAssesses visual relationships, patterns, and outliers among features in a dataset through scatter plot matrices....TrueFalse['dataset']{}['tabular_data', 'visualization']['classification', 'regression']
validmind.data_validation.ScoreBandDefaultRatesScore Band Default RatesAnalyzes default rates and population distribution across credit score bands....FalseTrue['dataset', 'model']{'score_column': {'type': 'str', 'default': 'score'}, 'score_bands': {'type': 'list', 'default': None}}['visualization', 'credit_risk', 'scorecard']['classification']
validmind.data_validation.ShapiroWilkShapiro WilkEvaluates feature-wise normality of training data using the Shapiro-Wilk test....FalseTrue['dataset']{}['tabular_data', 'data_distribution', 'statistical_test']['classification', 'regression']
validmind.data_validation.SkewnessSkewnessEvaluates the skewness of numerical data in a dataset to check against a defined threshold, aiming to ensure data...FalseTrue['dataset']{'max_threshold': {'type': '_empty', 'default': 1}}['data_quality', 'tabular_data']['classification', 'regression']
validmind.data_validation.TabularCategoricalBarPlotsTabular Categorical Bar PlotsGenerates and visualizes bar plots for each category in categorical features to evaluate the dataset's composition....TrueFalse['dataset']{}['tabular_data', 'visualization']['classification', 'regression']
validmind.data_validation.TabularDateTimeHistogramsTabular Date Time HistogramsGenerates histograms to provide graphical insight into the distribution of time intervals in a model's datetime...TrueFalse['dataset']{}['time_series_data', 'visualization']['classification', 'regression']
validmind.data_validation.TabularDescriptionTablesTabular Description TablesSummarizes key descriptive statistics for numerical, categorical, and datetime variables in a dataset....FalseTrue['dataset']{}['tabular_data']['classification', 'regression']
validmind.data_validation.TabularNumericalHistogramsTabular Numerical HistogramsGenerates histograms for each numerical feature in a dataset to provide visual insights into data distribution and...TrueFalse['dataset']{}['tabular_data', 'visualization']['classification', 'regression']
validmind.data_validation.TargetRateBarPlotsTarget Rate Bar PlotsGenerates bar plots visualizing the default rates of categorical features for a classification machine learning...TrueFalse['dataset']{}['tabular_data', 'visualization', 'categorical_data']['classification']
validmind.data_validation.TooManyZeroValuesToo Many Zero ValuesIdentifies numerical columns in a dataset that contain an excessive number of zero values, defined by a threshold...FalseTrue['dataset']{'max_percent_threshold': {'type': 'float', 'default': 0.03}}['tabular_data']['regression', 'classification']
validmind.data_validation.UniqueRowsUnique RowsVerifies the diversity of the dataset by ensuring that the count of unique rows exceeds a prescribed threshold....FalseTrue['dataset']{'min_percent_threshold': {'type': 'float', 'default': 1}}['tabular_data']['regression', 'classification']
validmind.data_validation.WOEBinPlotsWOE Bin PlotsGenerates visualizations of Weight of Evidence (WoE) and Information Value (IV) for understanding predictive power...TrueFalse['dataset']{'breaks_adj': {'type': 'list', 'default': None}, 'fig_height': {'type': 'int', 'default': 600}, 'fig_width': {'type': 'int', 'default': 500}}['tabular_data', 'visualization', 'categorical_data']['classification']
validmind.data_validation.WOEBinTableWOE Bin TableAssesses the Weight of Evidence (WoE) and Information Value (IV) of each feature to evaluate its predictive power...FalseTrue['dataset']{'breaks_adj': {'type': 'list', 'default': None}}['tabular_data', 'categorical_data']['classification']
validmind.model_validation.FeaturesAUCFeatures AUCEvaluates the discriminatory power of each individual feature within a binary classification model by calculating...TrueFalse['dataset']{'fontsize': {'type': 'int', 'default': 12}, 'figure_height': {'type': 'int', 'default': 500}}['feature_importance', 'AUC', 'visualization']['classification']
validmind.model_validation.sklearn.CalibrationCurveCalibration CurveEvaluates the calibration of probability estimates by comparing predicted probabilities against observed...TrueFalse['model', 'dataset']{'n_bins': {'type': 'int', 'default': 10}}['sklearn', 'model_performance', 'classification']['classification']
validmind.model_validation.sklearn.ClassifierPerformanceClassifier PerformanceEvaluates performance of binary or multiclass classification models using precision, recall, F1-Score, accuracy,...FalseTrue['dataset', 'model']{'average': {'type': 'str', 'default': 'macro'}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']['classification', 'text_classification']
validmind.model_validation.sklearn.ClassifierThresholdOptimizationClassifier Threshold OptimizationAnalyzes and visualizes different threshold optimization methods for binary classification models....FalseTrue['dataset', 'model']{'methods': {'type': None, 'default': None}, 'target_recall': {'type': None, 'default': None}}['model_validation', 'threshold_optimization', 'classification_metrics']['classification']
validmind.model_validation.sklearn.ConfusionMatrixConfusion MatrixEvaluates and visually represents the classification ML model's predictive performance using a Confusion Matrix...TrueFalse['dataset', 'model']{'threshold': {'type': 'float', 'default': 0.5}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization']['classification', 'text_classification']
validmind.model_validation.sklearn.HyperParametersTuningHyper Parameters TuningPerforms exhaustive grid search over specified parameter ranges to find optimal model configurations...FalseTrue['model', 'dataset']{'param_grid': {'type': 'dict', 'default': None}, 'scoring': {'type': None, 'default': None}, 'thresholds': {'type': None, 'default': None}, 'fit_params': {'type': 'dict', 'default': None}}['sklearn', 'model_performance']['clustering', 'classification']
validmind.model_validation.sklearn.MinimumAccuracyMinimum AccuracyChecks if the model's prediction accuracy meets or surpasses a specified threshold....FalseTrue['dataset', 'model']{'min_threshold': {'type': 'float', 'default': 0.7}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']['classification', 'text_classification']
validmind.model_validation.sklearn.MinimumF1ScoreMinimum F1 ScoreAssesses if the model's F1 score on the validation set meets a predefined minimum threshold, ensuring balanced...FalseTrue['dataset', 'model']{'min_threshold': {'type': 'float', 'default': 0.5}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']['classification', 'text_classification']
validmind.model_validation.sklearn.MinimumROCAUCScoreMinimum ROCAUC ScoreValidates model by checking if the ROC AUC score meets or surpasses a specified threshold....FalseTrue['dataset', 'model']{'min_threshold': {'type': 'float', 'default': 0.5}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']['classification', 'text_classification']
validmind.model_validation.sklearn.ModelParametersModel ParametersExtracts and displays model parameters in a structured format for transparency and reproducibility....FalseTrue['model']{'model_params': {'type': None, 'default': None}}['model_training', 'metadata']['classification', 'regression']
validmind.model_validation.sklearn.ModelsPerformanceComparisonModels Performance ComparisonEvaluates and compares the performance of multiple Machine Learning models using various metrics like accuracy,...FalseTrue['dataset', 'models']{}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'model_comparison']['classification', 'text_classification']
validmind.model_validation.sklearn.OverfitDiagnosisOverfit DiagnosisAssesses potential overfitting in a model's predictions, identifying regions where performance between training and...TrueTrue['model', 'datasets']{'metric': {'type': 'str', 'default': None}, 'cut_off_threshold': {'type': 'float', 'default': 0.04}}['sklearn', 'binary_classification', 'multiclass_classification', 'linear_regression', 'model_diagnosis']['classification', 'regression']
validmind.model_validation.sklearn.PermutationFeatureImportancePermutation Feature ImportanceAssesses the significance of each feature in a model by evaluating the impact on model performance when feature...TrueFalse['model', 'dataset']{'fontsize': {'type': None, 'default': None}, 'figure_height': {'type': None, 'default': None}}['sklearn', 'binary_classification', 'multiclass_classification', 'feature_importance', 'visualization']['classification', 'text_classification']
validmind.model_validation.sklearn.PopulationStabilityIndexPopulation Stability IndexAssesses the Population Stability Index (PSI) to quantify the stability of an ML model's predictions across...TrueTrue['datasets', 'model']{'num_bins': {'type': 'int', 'default': 10}, 'mode': {'type': 'str', 'default': 'fixed'}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']['classification', 'text_classification']
validmind.model_validation.sklearn.PrecisionRecallCurvePrecision Recall CurveEvaluates the precision-recall trade-off for binary classification models and visualizes the Precision-Recall curve....TrueFalse['model', 'dataset']{}['sklearn', 'binary_classification', 'model_performance', 'visualization']['classification', 'text_classification']
validmind.model_validation.sklearn.ROCCurveROC CurveEvaluates binary classification model performance by generating and plotting the Receiver Operating Characteristic...TrueFalse['model', 'dataset']{}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization']['classification', 'text_classification']
validmind.model_validation.sklearn.RegressionErrorsRegression ErrorsAssesses the performance and error distribution of a regression model using various error metrics....FalseTrue['model', 'dataset']{}['sklearn', 'model_performance']['regression', 'classification']
validmind.model_validation.sklearn.RobustnessDiagnosisRobustness DiagnosisAssesses the robustness of a machine learning model by evaluating performance decay under noisy conditions....TrueTrue['datasets', 'model']{'metric': {'type': 'str', 'default': None}, 'scaling_factor_std_dev_list': {'type': None, 'default': [0.1, 0.2, 0.3, 0.4, 0.5]}, 'performance_decay_threshold': {'type': 'float', 'default': 0.05}}['sklearn', 'model_diagnosis', 'visualization']['classification', 'regression']
validmind.model_validation.sklearn.SHAPGlobalImportanceSHAP Global ImportanceEvaluates and visualizes global feature importance using SHAP values for model explanation and risk identification....FalseTrue['model', 'dataset']{'kernel_explainer_samples': {'type': 'int', 'default': 10}, 'tree_or_linear_explainer_samples': {'type': 'int', 'default': 200}, 'class_of_interest': {'type': None, 'default': None}}['sklearn', 'binary_classification', 'multiclass_classification', 'feature_importance', 'visualization']['classification', 'text_classification']
validmind.model_validation.sklearn.ScoreProbabilityAlignmentScore Probability AlignmentAnalyzes the alignment between credit scores and predicted probabilities....TrueTrue['model', 'dataset']{'score_column': {'type': 'str', 'default': 'score'}, 'n_bins': {'type': 'int', 'default': 10}}['visualization', 'credit_risk', 'calibration']['classification']
validmind.model_validation.sklearn.TrainingTestDegradationTraining Test DegradationTests if model performance degradation between training and test datasets exceeds a predefined threshold....FalseTrue['datasets', 'model']{'max_threshold': {'type': 'float', 'default': 0.1}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization']['classification', 'text_classification']
validmind.model_validation.sklearn.WeakspotsDiagnosisWeakspots DiagnosisIdentifies and visualizes weak spots in a machine learning model's performance across various sections of the...TrueTrue['datasets', 'model']{'features_columns': {'type': None, 'default': None}, 'metrics': {'type': None, 'default': None}, 'thresholds': {'type': None, 'default': None}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_diagnosis', 'visualization']['classification', 'text_classification']
validmind.model_validation.statsmodels.CumulativePredictionProbabilitiesCumulative Prediction ProbabilitiesVisualizes cumulative probabilities of positive and negative classes for both training and testing in classification models....TrueFalse['dataset', 'model']{'title': {'type': 'str', 'default': 'Cumulative Probabilities'}}['visualization', 'credit_risk']['classification']
validmind.model_validation.statsmodels.GINITableGINI TableEvaluates classification model performance using AUC, GINI, and KS metrics for training and test datasets....FalseTrue['dataset', 'model']{}['model_performance']['classification']
validmind.model_validation.statsmodels.KolmogorovSmirnovKolmogorov SmirnovAssesses whether each feature in the dataset aligns with a normal distribution using the Kolmogorov-Smirnov test....FalseTrue['model', 'dataset']{'dist': {'type': 'str', 'default': 'norm'}}['tabular_data', 'data_distribution', 'statistical_test', 'statsmodels']['classification', 'regression']
validmind.model_validation.statsmodels.LillieforsLillieforsAssesses the normality of feature distributions in an ML model's training dataset using the Lilliefors test....FalseTrue['dataset']{}['tabular_data', 'data_distribution', 'statistical_test', 'statsmodels']['classification', 'regression']
validmind.model_validation.statsmodels.PredictionProbabilitiesHistogramPrediction Probabilities HistogramAssesses the predictive probability distribution for binary classification to evaluate model performance and...TrueFalse['dataset', 'model']{'title': {'type': 'str', 'default': 'Histogram of Predictive Probabilities'}}['visualization', 'credit_risk']['classification']
validmind.model_validation.statsmodels.ScorecardHistogramScorecard HistogramThe Scorecard Histogram test evaluates the distribution of credit scores between default and non-default instances,...TrueFalse['dataset']{'title': {'type': 'str', 'default': 'Histogram of Scores'}, 'score_column': {'type': 'str', 'default': 'score'}}['visualization', 'credit_risk', 'logistic_regression']['classification']
validmind.ongoing_monitoring.CalibrationCurveDriftCalibration Curve DriftEvaluates changes in probability calibration between reference and monitoring datasets....TrueTrue['datasets', 'model']{'n_bins': {'type': 'int', 'default': 10}, 'drift_pct_threshold': {'type': 'float', 'default': 20}}['sklearn', 'binary_classification', 'model_performance', 'visualization']['classification', 'text_classification']
validmind.ongoing_monitoring.ClassDiscriminationDriftClass Discrimination DriftCompares classification discrimination metrics between reference and monitoring datasets....FalseTrue['datasets', 'model']{'drift_pct_threshold': {'type': '_empty', 'default': 20}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']['classification', 'text_classification']
validmind.ongoing_monitoring.ClassImbalanceDriftClass Imbalance DriftEvaluates drift in class distribution between reference and monitoring datasets....TrueTrue['datasets']{'drift_pct_threshold': {'type': 'float', 'default': 5.0}, 'title': {'type': 'str', 'default': 'Class Distribution Drift'}}['tabular_data', 'binary_classification', 'multiclass_classification']['classification']
validmind.ongoing_monitoring.ClassificationAccuracyDriftClassification Accuracy DriftCompares classification accuracy metrics between reference and monitoring datasets....FalseTrue['datasets', 'model']{'drift_pct_threshold': {'type': '_empty', 'default': 20}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']['classification', 'text_classification']
validmind.ongoing_monitoring.ConfusionMatrixDriftConfusion Matrix DriftCompares confusion matrix metrics between reference and monitoring datasets....FalseTrue['datasets', 'model']{'drift_pct_threshold': {'type': '_empty', 'default': 20}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance']['classification', 'text_classification']
validmind.ongoing_monitoring.CumulativePredictionProbabilitiesDriftCumulative Prediction Probabilities DriftCompares cumulative prediction probability distributions between reference and monitoring datasets....TrueFalse['datasets', 'model']{}['visualization', 'credit_risk']['classification']
validmind.ongoing_monitoring.PredictionProbabilitiesHistogramDriftPrediction Probabilities Histogram DriftCompares prediction probability distributions between reference and monitoring datasets....TrueTrue['datasets', 'model']{'title': {'type': '_empty', 'default': 'Prediction Probabilities Histogram Drift'}, 'drift_pct_threshold': {'type': 'float', 'default': 20.0}}['visualization', 'credit_risk']['classification']
validmind.ongoing_monitoring.ROCCurveDriftROC Curve DriftCompares ROC curves between reference and monitoring datasets....TrueFalse['datasets', 'model']{}['sklearn', 'binary_classification', 'model_performance', 'visualization']['classification', 'text_classification']
validmind.ongoing_monitoring.ScoreBandsDriftScore Bands DriftAnalyzes drift in population distribution and default rates across score bands....FalseTrue['datasets', 'model']{'score_column': {'type': 'str', 'default': 'score'}, 'score_bands': {'type': 'list', 'default': None}, 'drift_threshold': {'type': 'float', 'default': 20.0}}['visualization', 'credit_risk', 'scorecard']['classification']
validmind.ongoing_monitoring.ScorecardHistogramDriftScorecard Histogram DriftCompares score distributions between reference and monitoring datasets for each class....TrueTrue['datasets']{'score_column': {'type': 'str', 'default': 'score'}, 'title': {'type': 'str', 'default': 'Scorecard Histogram Drift'}, 'drift_pct_threshold': {'type': 'float', 'default': 20.0}}['visualization', 'credit_risk', 'logistic_regression']['classification']
validmind.unit_metrics.classification.AccuracyAccuracyCalculates the accuracy of a modelFalseFalse['dataset', 'model']{}['classification']['classification']
validmind.unit_metrics.classification.F1F1Calculates the F1 score for a classification model.FalseFalse['model', 'dataset']{}['classification']['classification']
validmind.unit_metrics.classification.PrecisionPrecisionCalculates the precision for a classification model.FalseFalse['model', 'dataset']{}['classification']['classification']
validmind.unit_metrics.classification.ROC_AUCROC AUCCalculates the ROC AUC for a classification model.FalseFalse['model', 'dataset']{}['classification']['classification']
validmind.unit_metrics.classification.RecallRecallCalculates the recall for a classification model.FalseFalse['model', 'dataset']{}['classification']['classification']
\n" ], - "source": [ - "list_tests(tags=[\"model_performance\", \"visualization\"])" + "text/plain": [ + "" ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Use `filter`, `task`, and `tags` together to create more specific queries.\n", - "\n", - "For example, apply all three to find tests compatible with `sklearn` models, designed for `classification` tasks:" - ] - }, + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "list_tests(task=\"classification\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Use the `tags` parameter to find tests based on their tags, such as `model_performance` or `visualization`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
IDNameDescriptionHas FigureHas TableRequired InputsParamsTagsTasks
validmind.model_validation.sklearn.ConfusionMatrixConfusion MatrixEvaluates and visually represents the classification ML model's predictive performance using a Confusion Matrix...TrueFalse['dataset', 'model']{'threshold': {'type': 'float', 'default': 0.5}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization']['classification', 'text_classification']
validmind.model_validation.sklearn.PrecisionRecallCurvePrecision Recall CurveEvaluates the precision-recall trade-off for binary classification models and visualizes the Precision-Recall curve....TrueFalse['model', 'dataset']{}['sklearn', 'binary_classification', 'model_performance', 'visualization']['classification', 'text_classification']
validmind.model_validation.sklearn.ROCCurveROC CurveEvaluates binary classification model performance by generating and plotting the Receiver Operating Characteristic...TrueFalse['model', 'dataset']{}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization']['classification', 'text_classification']
validmind.model_validation.sklearn.TrainingTestDegradationTraining Test DegradationTests if model performance degradation between training and test datasets exceeds a predefined threshold....FalseTrue['datasets', 'model']{'max_threshold': {'type': 'float', 'default': 0.1}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization']['classification', 'text_classification']
validmind.ongoing_monitoring.CalibrationCurveDriftCalibration Curve DriftEvaluates changes in probability calibration between reference and monitoring datasets....TrueTrue['datasets', 'model']{'n_bins': {'type': 'int', 'default': 10}, 'drift_pct_threshold': {'type': 'float', 'default': 20}}['sklearn', 'binary_classification', 'model_performance', 'visualization']['classification', 'text_classification']
validmind.ongoing_monitoring.ROCCurveDriftROC Curve DriftCompares ROC curves between reference and monitoring datasets....TrueFalse['datasets', 'model']{}['sklearn', 'binary_classification', 'model_performance', 'visualization']['classification', 'text_classification']
\n" - ], - "text/plain": [ - "" - ] - }, - "execution_count": 9, - "metadata": {}, - "output_type": "execute_result" - } + "data": { + "text/html": [ + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
IDNameDescriptionHas FigureHas TableRequired InputsParamsTagsTasks
validmind.model_validation.RegressionResidualsPlotRegression Residuals PlotEvaluates regression model performance using residual distribution and actual vs. predicted plots....TrueFalse['model', 'dataset']{'bin_size': {'type': 'float', 'default': 0.1}}['model_performance', 'visualization']['regression']
validmind.model_validation.sklearn.ConfusionMatrixConfusion MatrixEvaluates and visually represents the classification ML model's predictive performance using a Confusion Matrix...TrueFalse['dataset', 'model']{'threshold': {'type': 'float', 'default': 0.5}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization']['classification', 'text_classification']
validmind.model_validation.sklearn.PrecisionRecallCurvePrecision Recall CurveEvaluates the precision-recall trade-off for binary classification models and visualizes the Precision-Recall curve....TrueFalse['model', 'dataset']{}['sklearn', 'binary_classification', 'model_performance', 'visualization']['classification', 'text_classification']
validmind.model_validation.sklearn.ROCCurveROC CurveEvaluates binary classification model performance by generating and plotting the Receiver Operating Characteristic...TrueFalse['model', 'dataset']{}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization']['classification', 'text_classification']
validmind.model_validation.sklearn.TrainingTestDegradationTraining Test DegradationTests if model performance degradation between training and test datasets exceeds a predefined threshold....FalseTrue['datasets', 'model']{'max_threshold': {'type': 'float', 'default': 0.1}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization']['classification', 'text_classification']
validmind.ongoing_monitoring.CalibrationCurveDriftCalibration Curve DriftEvaluates changes in probability calibration between reference and monitoring datasets....TrueTrue['datasets', 'model']{'n_bins': {'type': 'int', 'default': 10}, 'drift_pct_threshold': {'type': 'float', 'default': 20}}['sklearn', 'binary_classification', 'model_performance', 'visualization']['classification', 'text_classification']
validmind.ongoing_monitoring.ROCCurveDriftROC Curve DriftCompares ROC curves between reference and monitoring datasets....TrueFalse['datasets', 'model']{}['sklearn', 'binary_classification', 'model_performance', 'visualization']['classification', 'text_classification']
\n" ], - "source": [ - "list_tests(filter=\"sklearn\",\n", - " tags=[\"model_performance\", \"visualization\"], task=\"classification\"\n", - ")" + "text/plain": [ + "" ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "## Store test sets for use\n", - "\n", - "Once you've identified specific sets of tests you'd like to run, you can store the tests in variables, enabling you to easily reuse those tests in later steps.\n", - "\n", - "For example, if you're validating a summarization model, use [`list_tests()`](https://docs.validmind.ai/validmind/validmind/tests.html#list_tests) to retrieve all tests tagged for text summarization and save them to `text_summarization_tests` for later use:" - ] - }, + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "list_tests(tags=[\"model_performance\", \"visualization\"])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Use `filter`, `task`, and `tags` together to create more specific queries.\n", + "\n", + "For example, apply all three to find tests compatible with `sklearn` models, designed for `classification` tasks:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "['validmind.data_validation.DatasetDescription',\n", - " 'validmind.data_validation.DatasetSplit',\n", - " 'validmind.data_validation.nlp.CommonWords',\n", - " 'validmind.data_validation.nlp.Hashtags',\n", - " 'validmind.data_validation.nlp.LanguageDetection',\n", - " 'validmind.data_validation.nlp.Mentions',\n", - " 'validmind.data_validation.nlp.Punctuations',\n", - " 'validmind.data_validation.nlp.StopWords',\n", - " 'validmind.data_validation.nlp.TextDescription',\n", - " 'validmind.model_validation.BertScore',\n", - " 'validmind.model_validation.BleuScore',\n", - " 'validmind.model_validation.ContextualRecall',\n", - " 'validmind.model_validation.MeteorScore',\n", - " 'validmind.model_validation.RegardScore',\n", - " 'validmind.model_validation.RougeScore',\n", - " 'validmind.model_validation.TokenDisparity',\n", - " 'validmind.model_validation.ToxicityScore',\n", - " 'validmind.model_validation.embeddings.CosineSimilarityComparison',\n", - " 'validmind.model_validation.embeddings.CosineSimilarityHeatmap',\n", - " 'validmind.model_validation.embeddings.EuclideanDistanceComparison',\n", - " 'validmind.model_validation.embeddings.EuclideanDistanceHeatmap',\n", - " 'validmind.model_validation.embeddings.PCAComponentsPairwisePlots',\n", - " 'validmind.model_validation.embeddings.TSNEComponentsPairwisePlots',\n", - " 'validmind.model_validation.ragas.AnswerCorrectness',\n", - " 'validmind.model_validation.ragas.AspectCritic',\n", - " 'validmind.model_validation.ragas.ContextEntityRecall',\n", - " 'validmind.model_validation.ragas.ContextPrecision',\n", - " 'validmind.model_validation.ragas.ContextPrecisionWithoutReference',\n", - " 'validmind.model_validation.ragas.ContextRecall',\n", - " 'validmind.model_validation.ragas.Faithfulness',\n", - " 'validmind.model_validation.ragas.NoiseSensitivity',\n", - " 'validmind.model_validation.ragas.ResponseRelevancy',\n", - " 'validmind.model_validation.ragas.SemanticSimilarity',\n", - " 'validmind.prompt_validation.Bias',\n", - " 'validmind.prompt_validation.Clarity',\n", - " 'validmind.prompt_validation.Conciseness',\n", - " 'validmind.prompt_validation.Delimitation',\n", - " 'validmind.prompt_validation.NegativeInstruction',\n", - " 'validmind.prompt_validation.Robustness',\n", - " 'validmind.prompt_validation.Specificity']" - ] - }, - "execution_count": 10, - "metadata": {}, - "output_type": "execute_result" - } + "data": { + "text/html": [ + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
IDNameDescriptionHas FigureHas TableRequired InputsParamsTagsTasks
validmind.model_validation.sklearn.ConfusionMatrixConfusion MatrixEvaluates and visually represents the classification ML model's predictive performance using a Confusion Matrix...TrueFalse['dataset', 'model']{'threshold': {'type': 'float', 'default': 0.5}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization']['classification', 'text_classification']
validmind.model_validation.sklearn.PrecisionRecallCurvePrecision Recall CurveEvaluates the precision-recall trade-off for binary classification models and visualizes the Precision-Recall curve....TrueFalse['model', 'dataset']{}['sklearn', 'binary_classification', 'model_performance', 'visualization']['classification', 'text_classification']
validmind.model_validation.sklearn.ROCCurveROC CurveEvaluates binary classification model performance by generating and plotting the Receiver Operating Characteristic...TrueFalse['model', 'dataset']{}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization']['classification', 'text_classification']
validmind.model_validation.sklearn.TrainingTestDegradationTraining Test DegradationTests if model performance degradation between training and test datasets exceeds a predefined threshold....FalseTrue['datasets', 'model']{'max_threshold': {'type': 'float', 'default': 0.1}}['sklearn', 'binary_classification', 'multiclass_classification', 'model_performance', 'visualization']['classification', 'text_classification']
validmind.ongoing_monitoring.CalibrationCurveDriftCalibration Curve DriftEvaluates changes in probability calibration between reference and monitoring datasets....TrueTrue['datasets', 'model']{'n_bins': {'type': 'int', 'default': 10}, 'drift_pct_threshold': {'type': 'float', 'default': 20}}['sklearn', 'binary_classification', 'model_performance', 'visualization']['classification', 'text_classification']
validmind.ongoing_monitoring.ROCCurveDriftROC Curve DriftCompares ROC curves between reference and monitoring datasets....TrueFalse['datasets', 'model']{}['sklearn', 'binary_classification', 'model_performance', 'visualization']['classification', 'text_classification']
\n" ], - "source": [ - "text_summarization_tests = list_tests(task=\"text_summarization\", pretty=False)\n", - "text_summarization_tests" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "## Next steps\n", - "\n", - "Now that you know how to browse and filter tests in the ValidMind Library, you’re ready to take the next step. Use the test IDs you’ve selected to either run individual tests or batch run them with custom test suites.\n", - "\n", - "
Learn about the tests suites available in the ValidMind Library.\n", - "

\n", - "Check out our Explore test suites notebook for more code examples and usage of key functions.
\n", - "\n", - "\n", - "\n", - "### Discover more learning resources\n", - "\n", - "We offer many interactive notebooks to help you automate testing, documenting, validating, and more:\n", - "\n", - "- [Run tests & test suites](https://docs.validmind.ai/developer/how-to/testing-overview.html)\n", - "- [Use ValidMind Library features](https://docs.validmind.ai/developer/how-to/feature-overview.html)\n", - "- [Code samples by use case](https://docs.validmind.ai/guide/samples-jupyter-notebooks.html)\n", - "\n", - "Or, visit our [documentation](https://docs.validmind.ai/) to learn more about ValidMind." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "## Upgrade ValidMind\n", - "\n", - "
After installing ValidMind, you'll want to periodically make sure you are on the latest version to access any new features and other enhancements.
\n", - "\n", - "Retrieve the information for the currently installed version of ValidMind:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%pip show validmind" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "If the version returned is lower than the version indicated in our [production open-source code](https://github.com/validmind/validmind-library/blob/prod/validmind/__version__.py), restart your notebook and run:\n", - "\n", - "```bash\n", - "%pip install --upgrade validmind\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You may need to restart your kernel after running the upgrade package for changes to be applied." + "text/plain": [ + "" ] - }, + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "list_tests(filter=\"sklearn\",\n", + " tags=[\"model_performance\", \"visualization\"], task=\"classification\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "## Store test sets for use\n", + "\n", + "Once you've identified specific sets of tests you'd like to run, you can store the tests in variables, enabling you to easily reuse those tests in later steps.\n", + "\n", + "For example, if you're validating a summarization model, use [`list_tests()`](https://docs.validmind.ai/validmind/validmind/tests.html#list_tests) to retrieve all tests tagged for text summarization and save them to `text_summarization_tests` for later use:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - "\n", - "\n", - "\n", - "***\n", - "\n", - "Copyright © 2023-2026 ValidMind Inc. All rights reserved.
\n", - "Refer to [LICENSE](https://github.com/validmind/validmind-library/blob/main/LICENSE) for details.
\n", - "SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial
" + "data": { + "text/plain": [ + "['validmind.data_validation.DatasetDescription',\n", + " 'validmind.data_validation.DatasetSplit',\n", + " 'validmind.data_validation.nlp.CommonWords',\n", + " 'validmind.data_validation.nlp.Hashtags',\n", + " 'validmind.data_validation.nlp.LanguageDetection',\n", + " 'validmind.data_validation.nlp.Mentions',\n", + " 'validmind.data_validation.nlp.Punctuations',\n", + " 'validmind.data_validation.nlp.StopWords',\n", + " 'validmind.data_validation.nlp.TextDescription',\n", + " 'validmind.model_validation.BertScore',\n", + " 'validmind.model_validation.BleuScore',\n", + " 'validmind.model_validation.ContextualRecall',\n", + " 'validmind.model_validation.MeteorScore',\n", + " 'validmind.model_validation.RegardScore',\n", + " 'validmind.model_validation.RougeScore',\n", + " 'validmind.model_validation.TokenDisparity',\n", + " 'validmind.model_validation.ToxicityScore',\n", + " 'validmind.model_validation.embeddings.CosineSimilarityComparison',\n", + " 'validmind.model_validation.embeddings.CosineSimilarityHeatmap',\n", + " 'validmind.model_validation.embeddings.EuclideanDistanceComparison',\n", + " 'validmind.model_validation.embeddings.EuclideanDistanceHeatmap',\n", + " 'validmind.model_validation.embeddings.PCAComponentsPairwisePlots',\n", + " 'validmind.model_validation.embeddings.TSNEComponentsPairwisePlots',\n", + " 'validmind.model_validation.ragas.AnswerCorrectness',\n", + " 'validmind.model_validation.ragas.AspectCritic',\n", + " 'validmind.model_validation.ragas.ContextEntityRecall',\n", + " 'validmind.model_validation.ragas.ContextPrecision',\n", + " 'validmind.model_validation.ragas.ContextPrecisionWithoutReference',\n", + " 'validmind.model_validation.ragas.ContextRecall',\n", + " 'validmind.model_validation.ragas.Faithfulness',\n", + " 'validmind.model_validation.ragas.NoiseSensitivity',\n", + " 'validmind.model_validation.ragas.ResponseRelevancy',\n", + " 'validmind.model_validation.ragas.SemanticSimilarity',\n", + " 'validmind.prompt_validation.Bias',\n", + " 'validmind.prompt_validation.Clarity',\n", + " 'validmind.prompt_validation.Conciseness',\n", + " 'validmind.prompt_validation.Delimitation',\n", + " 'validmind.prompt_validation.NegativeInstruction',\n", + " 'validmind.prompt_validation.Robustness',\n", + " 'validmind.prompt_validation.Specificity']" ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" } - ], - "metadata": { - "kernelspec": { - "display_name": ".venv", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.20" - } + ], + "source": [ + "text_summarization_tests = list_tests(task=\"text_summarization\", pretty=False)\n", + "text_summarization_tests" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "## Next steps\n", + "\n", + "Now that you know how to browse and filter tests in the ValidMind Library, you’re ready to take the next step. Use the test IDs you’ve selected to either run individual tests or batch run them with custom test suites.\n", + "\n", + "
Learn about the tests suites available in the ValidMind Library.\n", + "

\n", + "Check out our Explore test suites notebook for more code examples and usage of key functions.
\n", + "\n", + "\n", + "\n", + "### Discover more learning resources\n", + "\n", + "We offer many interactive notebooks to help you automate testing, documenting, validating, and more:\n", + "\n", + "- [Run tests & test suites](https://docs.validmind.ai/developer/how-to/testing-overview.html)\n", + "- [Use ValidMind Library features](https://docs.validmind.ai/developer/how-to/feature-overview.html)\n", + "- [Code samples by use case](https://docs.validmind.ai/guide/samples-jupyter-notebooks.html)\n", + "\n", + "Or, visit our [documentation](https://docs.validmind.ai/) to learn more about ValidMind." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "\n", + "## Upgrade ValidMind\n", + "\n", + "
After installing ValidMind, you'll want to periodically make sure you are on the latest version to access any new features and other enhancements.
\n", + "\n", + "Retrieve the information for the currently installed version of ValidMind:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%pip show validmind" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If the version returned is lower than the version indicated in our [production open-source code](https://github.com/validmind/validmind-library/blob/prod/validmind/__version__.py), restart your notebook and run:\n", + "\n", + "```bash\n", + "%pip install --upgrade validmind\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You may need to restart your kernel after running the upgrade package for changes to be applied." + ] + }, + { + "cell_type": "markdown", + "id": "copyright-fb6994d364c54669b356f7a2278d6480", + "metadata": {}, + "source": [ + "\n", + "\n", + "\n", + "\n", + "***\n", + "\n", + "Copyright © 2023-2026 ValidMind Inc. All rights reserved.
\n", + "Refer to [LICENSE](https://github.com/validmind/validmind-library/blob/main/LICENSE) for details.
\n", + "SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial
" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" }, - "nbformat": 4, - "nbformat_minor": 4 + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.20" + } + }, + "nbformat": 4, + "nbformat_minor": 4 } diff --git a/notebooks/how_to/tests/run_tests/configure_tests/enable_pii_detection.ipynb b/notebooks/how_to/tests/run_tests/configure_tests/enable_pii_detection.ipynb index c33c655f4..cbb8af41d 100644 --- a/notebooks/how_to/tests/run_tests/configure_tests/enable_pii_detection.ipynb +++ b/notebooks/how_to/tests/run_tests/configure_tests/enable_pii_detection.ipynb @@ -1,677 +1,677 @@ { - "cells": [ - { - "cell_type": "markdown", - "id": "fafe2741", - "metadata": {}, - "source": [ - "# Enable PII detection in tests" - ] - }, - { - "cell_type": "markdown", - "id": "75cb4b61", - "metadata": {}, - "source": [ - "Learn how to enable and configure Personally Identifiable Information (PII) detection when running tests with the ValidMind Library. Choose whether or not to include PII in test descriptions generated, or whether or not to include PII in test results logged to the ValidMind Platform." - ] - }, - { - "cell_type": "markdown", - "id": "e4ebad56", - "metadata": {}, - "source": [ - "::: {.content-hidden when-format=\"html\"}\n", - "## Contents \n", - "- [About ValidMind](#toc1__) \n", - " - [Before you begin](#toc1_1__) \n", - " - [New to ValidMind?](#toc1_2__) \n", - " - [Key concepts](#toc1_3__) \n", - "- [Setting up](#toc2__) \n", - " - [Install the ValidMind Library with PII detection](#toc2_1__) \n", - " - [Initialize the ValidMind Library](#toc2_2__) \n", - " - [Get your code snippet](#toc2_2_1__) \n", - "- [Using PII detection](#toc3__) \n", - " - [Create a custom test that outputs PII](#toc3_1__) \n", - " - [Run test under different PII detection modes](#toc3_2__) \n", - " - [disabled](#toc3_2_1__) \n", - " - [test_results](#toc3_2_2__) \n", - " - [test_descriptions](#toc3_2_3__) \n", - " - [all](#toc3_2_4__) \n", - " - [Override detection](#toc3_3__) \n", - " - [Override test result logging](#toc3_3_1__) \n", - " - [Override test descriptions and test result logging](#toc3_3_2__) \n", - " - [Review logged test results](#toc3_4__) \n", - "- [Troubleshooting](#toc4__) \n", - "- [Learn more](#toc5__) \n", - "- [Upgrade ValidMind](#toc6__) \n", - "\n", - ":::\n", - "\n", - "" - ] - }, - { - "cell_type": "markdown", - "id": "a2f801a9", - "metadata": {}, - "source": [ - "\n", - "\n", - "## About ValidMind\n", - "\n", - "ValidMind is a suite of tools for managing model risk, including risk associated with AI and statistical models. \n", - "\n", - "You use the ValidMind Library to automate documentation and validation tests, and then use the ValidMind Platform to collaborate on model documentation. Together, these products simplify model risk management, facilitate compliance with regulations and institutional standards, and enhance collaboration between yourself and model validators." - ] - }, - { - "cell_type": "markdown", - "id": "e920bce6", - "metadata": {}, - "source": [ - "\n", - "\n", - "### Before you begin\n", - "\n", - "This notebook assumes you have basic familiarity with Python, including an understanding of how functions work. If you are new to Python, you can still run the notebook but we recommend further familiarizing yourself with the language. \n", - "\n", - "If you encounter errors due to missing modules in your Python environment, install the modules with `pip install`, and then re-run the notebook. For more help, refer to [Installing Python Modules](https://docs.python.org/3/installing/index.html)." - ] - }, - { - "cell_type": "markdown", - "id": "3a3fb4fc", - "metadata": {}, - "source": [ - "\n", - "\n", - "### New to ValidMind?\n", - "\n", - "If you haven't already seen our documentation on the [ValidMind Library](https://docs.validmind.ai/developer/validmind-library.html), we recommend you begin by exploring the available resources in this section. There, you can learn more about documenting models and running tests, as well as find code samples and our Python Library API reference.\n", - "\n", - "
For access to all features available in this notebook, you'll need access to a ValidMind account.\n", - "

\n", - "Register with ValidMind
" - ] - }, - { - "cell_type": "markdown", - "id": "9a49b776", - "metadata": {}, - "source": [ - "\n", - "\n", - "### Key concepts\n", - "\n", - "**Model documentation**: A structured and detailed record pertaining to a model, encompassing key components such as its underlying assumptions, methodologies, data sources, inputs, performance metrics, evaluations, limitations, and intended uses. It serves to ensure transparency, adherence to regulatory requirements, and a clear understanding of potential risks associated with the model’s application.\n", - "\n", - "**Documentation template**: Functions as a test suite and lays out the structure of model documentation, segmented into various sections and sub-sections. Documentation templates define the structure of your model documentation, specifying the tests that should be run, and how the results should be displayed.\n", - "\n", - "**Tests**: A function contained in the ValidMind Library, designed to run a specific quantitative test on the dataset or model. Tests are the building blocks of ValidMind, used to evaluate and document models and datasets, and can be run individually or as part of a suite defined by your model documentation template.\n", - "\n", - "**Metrics**: A subset of tests that do not have thresholds. In the context of this notebook, metrics and tests can be thought of as interchangeable concepts.\n", - "\n", - "**Custom metrics**: Custom metrics are functions that you define to evaluate your model or dataset. These functions can be registered with the ValidMind Library to be used in the ValidMind Platform.\n", - "\n", - "**Inputs**: Objects to be evaluated and documented in the ValidMind Library. They can be any of the following:\n", - "\n", - " - **model**: A single model that has been initialized in ValidMind with [`vm.init_model()`](https://docs.validmind.ai/validmind/validmind.html#init_model).\n", - " - **dataset**: Single dataset that has been initialized in ValidMind with [`vm.init_dataset()`](https://docs.validmind.ai/validmind/validmind.html#init_dataset).\n", - " - **models**: A list of ValidMind models - usually this is used when you want to compare multiple models in your custom metric.\n", - " - **datasets**: A list of ValidMind datasets - usually this is used when you want to compare multiple datasets in your custom metric. (Learn more: [Run tests with multiple datasets](https://docs.validmind.ai/notebooks/how_to/tests/run_tests/configure_tests/run_tests_that_require_multiple_datasets.html))\n", - "\n", - "**Parameters**: Additional arguments that can be passed when running a ValidMind test, used to pass additional information to a metric, customize its behavior, or provide additional context.\n", - "\n", - "**Outputs**: Custom metrics can return elements like tables or plots. Tables may be a list of dictionaries (each representing a row) or a pandas DataFrame. Plots may be matplotlib or plotly figures.\n", - "\n", - "**Test suites**: Collections of tests designed to run together to automate and generate model documentation end-to-end for specific use-cases.\n", - "\n", - "Example: the [`classifier_full_suite`](https://docs.validmind.ai/validmind/validmind/test_suites/classifier.html#ClassifierFullSuite) test suite runs tests from the [`tabular_dataset`](https://docs.validmind.ai/validmind/validmind/test_suites/tabular_datasets.html) and [`classifier`](https://docs.validmind.ai/validmind/validmind/test_suites/classifier.html) test suites to fully document the data and model sections for binary classification model use-cases." - ] - }, - { - "cell_type": "markdown", - "id": "41aee68d", - "metadata": {}, - "source": [ - "\n", - "\n", - "## Setting up" - ] - }, - { - "cell_type": "markdown", - "id": "ba30e377", - "metadata": {}, - "source": [ - "\n", - "\n", - "### Install the ValidMind Library with PII detection\n", - "\n", - "
Recommended Python versions\n", - "

\n", - "Python 3.8 <= x <= 3.11
\n", - "\n", - "To use PII detection powered by [Microsoft Presidio](https://microsoft.github.io/presidio/), install the library with the explicit `[pii-detection]` extra specifier:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b830ae91", - "metadata": {}, - "outputs": [], - "source": [ - "%pip install -q \"validmind[pii-detection]\"" - ] - }, - { - "cell_type": "markdown", - "id": "4b44677b", - "metadata": {}, - "source": [ - "\n", - "\n", - "### Initialize the ValidMind Library\n", - "\n", - "ValidMind generates a unique _code snippet_ for each registered model to connect with your developer environment. You initialize the ValidMind Library with this code snippet, which ensures that your documentation and tests are uploaded to the correct model when you run the notebook." - ] - }, - { - "cell_type": "markdown", - "id": "84464a2b", - "metadata": {}, - "source": [ - "\n", - "\n", - "#### Get your code snippet\n", - "\n", - "1. In a browser, [log in to ValidMind](https://docs.validmind.ai/guide/configuration/log-in-to-validmind.html).\n", - "\n", - "2. In the left sidebar, navigate to **Inventory** and click **+ Register Model**.\n", - "\n", - "3. Enter the model details and click **Continue**. ([Need more help?](https://docs.validmind.ai/guide/model-inventory/register-models-in-inventory.html))\n", - "\n", - "4. Go to **Getting Started** and click **Copy snippet to clipboard**.\n", - "\n", - "Next, [load your model identifier credentials from an `.env` file](https://docs.validmind.ai/developer/model-documentation/store-credentials-in-env-file.html) or replace the placeholder with your own code snippet:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "eeda4c8c", - "metadata": {}, - "outputs": [], - "source": [ - "# Load your model identifier credentials from an `.env` file\n", - "\n", - "%load_ext dotenv\n", - "%dotenv .env\n", - "\n", - "# Or replace with your code snippet\n", - "\n", - "import validmind as vm\n", - "\n", - "vm.init(\n", - " # api_host=\"...\",\n", - " # api_key=\"...\",\n", - " # api_secret=\"...\",\n", - " # model=\"...\",\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "62f24552", - "metadata": {}, - "source": [ - "\n", - "\n", - "## Using PII detection" - ] - }, - { - "cell_type": "markdown", - "id": "fd9b6e44", - "metadata": {}, - "source": [ - "\n", - "\n", - "### Create a custom test that outputs PII\n", - "\n", - "To demonstrate the feature, we'll need a test that outputs PII. First we'll create a custom test that returns:\n", - "\n", - "- A description string containing PII (name, email, phone)\n", - "- A small table containing PII in columns\n", - "\n", - "This output mirrors the structure used in other custom test notebooks and will exercise both table and description PII detection paths. However, if structured detection is unavailable, the library falls back to token-level text scans when possible." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "04d8c802", - "metadata": {}, - "outputs": [], - "source": [ - "import pandas as pd\n", - "\n", - "from validmind import test\n", - "\n", - "@test(\"pii_demo.PIIDetection\")\n", - "def pii_custom_test():\n", - " \"\"\"A custom test that returns demo PII.\n", - " This default test description will display when PII is not sent to the LLM to generate test descriptions based on test result data.\"\"\"\n", - " return pd.DataFrame(\n", - " {\n", - " \"name\": [\"Jane Smith\", \"John Doe\", \"Alice Johnson\"],\n", - " \"email\": [\n", - " \"jane.smith@bank.example\",\n", - " \"john.doe@company.example\",\n", - " \"alice.johnson@service.example\",\n", - " ],\n", - " \"phone\": [\"(212) 555-9876\", \"(415) 555-1234\", \"(646) 555-5678\"],\n", - " }\n", - " )" - ] - }, - { - "cell_type": "markdown", - "id": "53e02410", - "metadata": {}, - "source": [ - "
Want to learn more about custom tests?\n", - "

\n", - "Check out our extended introduction to custom tests — Implement custom tests
" - ] - }, - { - "cell_type": "markdown", - "id": "c4065f2a", - "metadata": {}, - "source": [ - "\n", - "\n", - "### Run test under different PII detection modes\n", - "\n", - "Next, let's import [the `run_test` function](https://docs.validmind.ai/validmind/validmind/tests.html#run_test) provided by the `validmind.tests` module to run our custom test via a function called `run_pii_test()` that catches exceptions to observe blocking behavior when PII is present:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b42288e5", - "metadata": {}, - "outputs": [], - "source": [ - "import os\n", - "from validmind.tests import run_test\n", - "\n", - "# Run test and tag result with unique `result_id`\n", - "def run_pii_test(result_id=\"\"):\n", - " try:\n", - " test_name = f\"pii_demo.PIIDetection:{result_id}\"\n", - " result = run_test(test_name)\n", - "\n", - " # Check if the test description was generated by LLM\n", - " if not result._was_description_generated:\n", - " print(\"PII detected: LLM-generated test description skipped\")\n", - " else:\n", - " print(\"No PII detected or detection disabled: Test description generated by LLM\")\n", - "\n", - " # Try logging test results to the ValidMind Platform\n", - " result.log()\n", - " print(\"No PII detected or detection disabled: Test results logged to the ValidMind Platform\")\n", - " except Exception as e:\n", - " print(\"PII detected: Test results not logged to the ValidMind Platform\")" - ] - }, - { - "cell_type": "markdown", - "id": "867dbd94", - "metadata": {}, - "source": [ - "We'll then switch the `VALIDMIND_PII_DETECTION` environment variable across modes in the below examples.\n", - "\n", - "
Note that since we are running a custom test that does not exist in your model's default documentation template, we'll receive output indicating that a test-driven block doesn't currently exist in your model's documentation for that particular test ID.\n", - "

\n", - "That's expected, as when we run custom tests the results logged need to be manually added to your documentation within the ValidMind Platform or added to your documentation template.
" - ] - }, - { - "cell_type": "markdown", - "id": "0e151763", - "metadata": {}, - "source": [ - "\n", - "\n", - "#### disabled\n", - "\n", - "When detection is set to `disabled`, tests run and generate test descriptions. Logging tests with [`.log()`](https://docs.validmind.ai/validmind/validmind/vm_models.html#TestResult.log) will also send test descriptions and test results to the ValidMind Platform as usual:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "3078af64", - "metadata": {}, - "outputs": [], - "source": [ - "print(\"\\n=== Mode: disabled ===\")\n", - "os.environ[\"VALIDMIND_PII_DETECTION\"] = \"disabled\"\n", - "\n", - "# Run test and tag result with unique ID `disabled`\n", - "run_pii_test(\"disabled\")" - ] - }, - { - "cell_type": "markdown", - "id": "c797d2e3", - "metadata": {}, - "source": [ - "\n", - "\n", - "#### test_results\n", - "\n", - "When detection is set for `test_results`, tests run and generate test descriptions for review in your environment, but logging tests will not send descriptions or test results to the ValidMind Platform:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "12e61a80", - "metadata": {}, - "outputs": [], - "source": [ - "print(\"\\n=== Mode: test_results ===\")\n", - "os.environ[\"VALIDMIND_PII_DETECTION\"] = \"test_results\"\n", - "\n", - "# Run test and tag result with unique ID `results_blocked`\n", - "run_pii_test(\"results_blocked\")" - ] - }, - { - "cell_type": "markdown", - "id": "9d5cb41c", - "metadata": {}, - "source": [ - "\n", - "\n", - "#### test_descriptions\n", - "\n", - "When detection is set for `test_descriptions`, tests run but will not generate test descriptions, and logging tests will not send descriptions but will send test results to the ValidMind Platform:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "feba6207", - "metadata": {}, - "outputs": [], - "source": [ - "print(\"\\n=== Mode: test_descriptions ===\")\n", - "os.environ[\"VALIDMIND_PII_DETECTION\"] = \"test_descriptions\"\n", - "\n", - "# Run test and tag result with unique ID `desc_blocked`\n", - "run_pii_test(\"desc_blocked\")" - ] - }, - { - "cell_type": "markdown", - "id": "1d3d7256", - "metadata": {}, - "source": [ - "\n", - "\n", - "#### all\n", - "\n", - "When detection is set to `all`, tests run will not generate test descriptions or log test results to the ValidMind Platform." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "af5040b5", - "metadata": {}, - "outputs": [], - "source": [ - "print(\"\\n=== Mode: all ===\")\n", - "os.environ[\"VALIDMIND_PII_DETECTION\"] = \"all\"\n", - "\n", - "# Run test and tag result with unique ID `all_blocked`\n", - "run_pii_test(\"all_blocked\")" - ] - }, - { - "cell_type": "markdown", - "id": "b1a5fd8e", - "metadata": {}, - "source": [ - "\n", - "\n", - "### Override detection\n", - "\n", - "You can override blocking by passing `unsafe=True` to `result.log(unsafe=True)`, but this is not recommended outside controlled workflows.\n", - "\n", - "To demonstrate, let's rerun our custom test with some override scenarios." - ] - }, - { - "cell_type": "markdown", - "id": "8a378b22", - "metadata": {}, - "source": [ - "\n", - "\n", - "#### Override test result logging\n", - "\n", - "First, let's rerun our custom test with detection set to `all`, which will send the test results but not the test descriptions to the ValidMind Platform:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "0387be21", - "metadata": {}, - "outputs": [], - "source": [ - "print(\"\\n=== Mode: all & unsafe=True ===\")\n", - "os.environ[\"VALIDMIND_PII_DETECTION\"] = \"all\"\n", - "\n", - "# Run test and tag result with unique ID `override_results`\n", - "try:\n", - " result = run_test(\"pii_demo.PIIDetection:override_results\")\n", - "\n", - " # Check if the test description was generated by LLM\n", - " if not result._was_description_generated:\n", - " print(\"PII detected: LLM-generated test description skipped\")\n", - " else:\n", - " print(\"No PII detected or detection disabled: Test description generated by LLM\")\n", - "\n", - " # Try logging test results to the ValidMind Platform\n", - " result.log(unsafe=True)\n", - " print(\"No PII detected, detection disabled, or override set: Test results logged to the ValidMind Platform\")\n", - "except Exception as e:\n", - " print(\"PII detected: Test results not logged to the ValidMind Platform\")" - ] - }, - { - "cell_type": "markdown", - "id": "8197c39c", - "metadata": {}, - "source": [ - "\n", - "\n", - "#### Override test descriptions and test result logging\n", - "\n", - "To send both the test descriptions and test results via override, set the `VALIDMIND_PII_DETECTION` environment variable to `test_results` while including the override flag:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b40a2670", - "metadata": {}, - "outputs": [], - "source": [ - "print(\"\\n=== Mode: test_results & unsafe=True ===\")\n", - "os.environ[\"VALIDMIND_PII_DETECTION\"] = \"test_results\"\n", - "\n", - "# Run test and tag result with unique ID `override_both`\n", - "try:\n", - " result = run_test(\"pii_demo.PIIDetection:override_both\")\n", - "\n", - " # Check if the test description was generated by LLM\n", - " if not result._was_description_generated:\n", - " print(\"PII detected: LLM-generated test description skipped\")\n", - " else:\n", - " print(\"No PII detected, detection disabled, or override set: Test description generated by LLM\")\n", - "\n", - " # Try logging test results to the ValidMind Platform\n", - " result.log(unsafe=True)\n", - " print(\"No PII detected, detection disabled, or override set: Test results logged to the ValidMind Platform\")\n", - "except Exception as e:\n", - " print(\"PII detected: Test results not logged to the ValidMind Platform\")" - ] - }, - { - "cell_type": "markdown", - "id": "f2ce4348", - "metadata": {}, - "source": [ - "\n", - "\n", - "### Review logged test results\n", - "\n", - "Now let's take a look at the results that were logged to the ValidMind Platform:\n", - "\n", - "1. From the **Inventory** in the ValidMind Platform, go to the model you registered earlier.\n", - "\n", - "2. In the left sidebar that appears for your model, click **Documentation** under Documents.\n", - "\n", - "3. Click on any section heading to expand that section to add a new test-driven block ([Need more help?](https://docs.validmind.ai/developer/model-documentation/work-with-test-results.html)).\n", - "\n", - "4. Under TEST-DRIVEN in the sidebar, click **Custom**.\n", - "\n", - "5. Confirm that you're able to insert the following logged results:\n", - "\n", - " - `pii_demo.PIIDetection:disabled`\n", - " - `pii_demo.PIIDetection:desc_blocked`\n", - " - `pii_demo.PIIDetection:override_results`\n", - " - `pii_demo.PIIDetection:override_both`" - ] - }, - { - "cell_type": "markdown", - "id": "d034b04c", - "metadata": {}, - "source": [ - "\n", - "\n", - "## Troubleshooting\n", - "\n", - "- [x] If you see warnings that Presidio or Presidio analyzer is unavailable, ensure you installed extras: `validmind[pii-detection]`.\n", - "- [x] Ensure your environment is restarted after installing new packages if imports fail." - ] - }, - { - "cell_type": "markdown", - "id": "1da184e0", - "metadata": {}, - "source": [ - "\n", - "\n", - "## Learn more\n", - "\n", - "We offer many interactive notebooks to help you automate testing, documenting, validating, and more:\n", - "\n", - "- [Run tests & test suites](https://docs.validmind.ai/developer/how-to/testing-overview.html)\n", - "- [Use ValidMind Library features](https://docs.validmind.ai/developer/how-to/feature-overview.html)\n", - "- [Code samples by use case](https://docs.validmind.ai/guide/samples-jupyter-notebooks.html)\n", - "\n", - "Or, visit our [documentation](https://docs.validmind.ai/) to learn more about ValidMind." - ] - }, - { - "cell_type": "markdown", - "id": "bcaf7fd4", - "metadata": {}, - "source": [ - "\n", - "\n", - "## Upgrade ValidMind\n", - "\n", - "
After installing ValidMind, you'll want to periodically make sure you are on the latest version to access any new features and other enhancements.
\n", - "\n", - "Retrieve the information for the currently installed version of ValidMind:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "dffb39a5", - "metadata": {}, - "outputs": [], - "source": [ - "%pip show validmind" - ] - }, - { - "cell_type": "markdown", - "id": "9e9f387d", - "metadata": {}, - "source": [ - "If the version returned is lower than the version indicated in our [production open-source code](https://github.com/validmind/validmind-library/blob/prod/validmind/__version__.py), restart your notebook and run:\n", - "\n", - "```bash\n", - "%pip install --upgrade validmind\n", - "```" - ] - }, - { - "cell_type": "markdown", - "id": "faf6cb0d", - "metadata": {}, - "source": [ - "You may need to restart your kernel after running the upgrade package for changes to be applied." - ] - }, - { - "cell_type": "markdown", - "id": "1931cbc1", - "metadata": {}, - "source": [ - "\n", - "\n", - "\n", - "\n", - "***\n", - "\n", - "Copyright © 2023-2026 ValidMind Inc. All rights reserved.
\n", - "Refer to [LICENSE](https://github.com/validmind/validmind-library/blob/main/LICENSE) for details.
\n", - "SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial
" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "name": "python", - "version": "3.10" - } - }, - "nbformat": 4, - "nbformat_minor": 5 + "cells": [ + { + "cell_type": "markdown", + "id": "fafe2741", + "metadata": {}, + "source": [ + "# Enable PII detection in tests" + ] + }, + { + "cell_type": "markdown", + "id": "75cb4b61", + "metadata": {}, + "source": [ + "Learn how to enable and configure Personally Identifiable Information (PII) detection when running tests with the ValidMind Library. Choose whether or not to include PII in test descriptions generated, or whether or not to include PII in test results logged to the ValidMind Platform." + ] + }, + { + "cell_type": "markdown", + "id": "e4ebad56", + "metadata": {}, + "source": [ + "::: {.content-hidden when-format=\"html\"}\n", + "## Contents \n", + "- [About ValidMind](#toc1__) \n", + " - [Before you begin](#toc1_1__) \n", + " - [New to ValidMind?](#toc1_2__) \n", + " - [Key concepts](#toc1_3__) \n", + "- [Setting up](#toc2__) \n", + " - [Install the ValidMind Library with PII detection](#toc2_1__) \n", + " - [Initialize the ValidMind Library](#toc2_2__) \n", + " - [Get your code snippet](#toc2_2_1__) \n", + "- [Using PII detection](#toc3__) \n", + " - [Create a custom test that outputs PII](#toc3_1__) \n", + " - [Run test under different PII detection modes](#toc3_2__) \n", + " - [disabled](#toc3_2_1__) \n", + " - [test_results](#toc3_2_2__) \n", + " - [test_descriptions](#toc3_2_3__) \n", + " - [all](#toc3_2_4__) \n", + " - [Override detection](#toc3_3__) \n", + " - [Override test result logging](#toc3_3_1__) \n", + " - [Override test descriptions and test result logging](#toc3_3_2__) \n", + " - [Review logged test results](#toc3_4__) \n", + "- [Troubleshooting](#toc4__) \n", + "- [Learn more](#toc5__) \n", + "- [Upgrade ValidMind](#toc6__) \n", + "\n", + ":::\n", + "\n", + "" + ] + }, + { + "cell_type": "markdown", + "id": "a2f801a9", + "metadata": {}, + "source": [ + "\n", + "\n", + "## About ValidMind\n", + "\n", + "ValidMind is a suite of tools for managing model risk, including risk associated with AI and statistical models. \n", + "\n", + "You use the ValidMind Library to automate documentation and validation tests, and then use the ValidMind Platform to collaborate on model documentation. Together, these products simplify model risk management, facilitate compliance with regulations and institutional standards, and enhance collaboration between yourself and model validators." + ] + }, + { + "cell_type": "markdown", + "id": "e920bce6", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Before you begin\n", + "\n", + "This notebook assumes you have basic familiarity with Python, including an understanding of how functions work. If you are new to Python, you can still run the notebook but we recommend further familiarizing yourself with the language. \n", + "\n", + "If you encounter errors due to missing modules in your Python environment, install the modules with `pip install`, and then re-run the notebook. For more help, refer to [Installing Python Modules](https://docs.python.org/3/installing/index.html)." + ] + }, + { + "cell_type": "markdown", + "id": "3a3fb4fc", + "metadata": {}, + "source": [ + "\n", + "\n", + "### New to ValidMind?\n", + "\n", + "If you haven't already seen our documentation on the [ValidMind Library](https://docs.validmind.ai/developer/validmind-library.html), we recommend you begin by exploring the available resources in this section. There, you can learn more about documenting models and running tests, as well as find code samples and our Python Library API reference.\n", + "\n", + "
For access to all features available in this notebook, you'll need access to a ValidMind account.\n", + "

\n", + "Register with ValidMind
" + ] + }, + { + "cell_type": "markdown", + "id": "9a49b776", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Key concepts\n", + "\n", + "**Model documentation**: A structured and detailed record pertaining to a model, encompassing key components such as its underlying assumptions, methodologies, data sources, inputs, performance metrics, evaluations, limitations, and intended uses. It serves to ensure transparency, adherence to regulatory requirements, and a clear understanding of potential risks associated with the model’s application.\n", + "\n", + "**Documentation template**: Functions as a test suite and lays out the structure of model documentation, segmented into various sections and sub-sections. Documentation templates define the structure of your model documentation, specifying the tests that should be run, and how the results should be displayed.\n", + "\n", + "**Tests**: A function contained in the ValidMind Library, designed to run a specific quantitative test on the dataset or model. Tests are the building blocks of ValidMind, used to evaluate and document models and datasets, and can be run individually or as part of a suite defined by your model documentation template.\n", + "\n", + "**Metrics**: A subset of tests that do not have thresholds. In the context of this notebook, metrics and tests can be thought of as interchangeable concepts.\n", + "\n", + "**Custom metrics**: Custom metrics are functions that you define to evaluate your model or dataset. These functions can be registered with the ValidMind Library to be used in the ValidMind Platform.\n", + "\n", + "**Inputs**: Objects to be evaluated and documented in the ValidMind Library. They can be any of the following:\n", + "\n", + " - **model**: A single model that has been initialized in ValidMind with [`vm.init_model()`](https://docs.validmind.ai/validmind/validmind.html#init_model).\n", + " - **dataset**: Single dataset that has been initialized in ValidMind with [`vm.init_dataset()`](https://docs.validmind.ai/validmind/validmind.html#init_dataset).\n", + " - **models**: A list of ValidMind models - usually this is used when you want to compare multiple models in your custom metric.\n", + " - **datasets**: A list of ValidMind datasets - usually this is used when you want to compare multiple datasets in your custom metric. (Learn more: [Run tests with multiple datasets](https://docs.validmind.ai/notebooks/how_to/tests/run_tests/configure_tests/run_tests_that_require_multiple_datasets.html))\n", + "\n", + "**Parameters**: Additional arguments that can be passed when running a ValidMind test, used to pass additional information to a metric, customize its behavior, or provide additional context.\n", + "\n", + "**Outputs**: Custom metrics can return elements like tables or plots. Tables may be a list of dictionaries (each representing a row) or a pandas DataFrame. Plots may be matplotlib or plotly figures.\n", + "\n", + "**Test suites**: Collections of tests designed to run together to automate and generate model documentation end-to-end for specific use-cases.\n", + "\n", + "Example: the [`classifier_full_suite`](https://docs.validmind.ai/validmind/validmind/test_suites/classifier.html#ClassifierFullSuite) test suite runs tests from the [`tabular_dataset`](https://docs.validmind.ai/validmind/validmind/test_suites/tabular_datasets.html) and [`classifier`](https://docs.validmind.ai/validmind/validmind/test_suites/classifier.html) test suites to fully document the data and model sections for binary classification model use-cases." + ] + }, + { + "cell_type": "markdown", + "id": "41aee68d", + "metadata": {}, + "source": [ + "\n", + "\n", + "## Setting up" + ] + }, + { + "cell_type": "markdown", + "id": "ba30e377", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Install the ValidMind Library with PII detection\n", + "\n", + "
Recommended Python versions\n", + "

\n", + "Python 3.8 <= x <= 3.11
\n", + "\n", + "To use PII detection powered by [Microsoft Presidio](https://microsoft.github.io/presidio/), install the library with the explicit `[pii-detection]` extra specifier:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b830ae91", + "metadata": {}, + "outputs": [], + "source": [ + "%pip install -q \"validmind[pii-detection]\"" + ] + }, + { + "cell_type": "markdown", + "id": "4b44677b", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Initialize the ValidMind Library\n", + "\n", + "ValidMind generates a unique _code snippet_ for each registered model to connect with your developer environment. You initialize the ValidMind Library with this code snippet, which ensures that your documentation and tests are uploaded to the correct model when you run the notebook." + ] + }, + { + "cell_type": "markdown", + "id": "84464a2b", + "metadata": {}, + "source": [ + "\n", + "\n", + "#### Get your code snippet\n", + "\n", + "1. In a browser, [log in to ValidMind](https://docs.validmind.ai/guide/configuration/log-in-to-validmind.html).\n", + "\n", + "2. In the left sidebar, navigate to **Inventory** and click **+ Register Model**.\n", + "\n", + "3. Enter the model details and click **Continue**. ([Need more help?](https://docs.validmind.ai/guide/model-inventory/register-models-in-inventory.html))\n", + "\n", + "4. Go to **Getting Started** and click **Copy snippet to clipboard**.\n", + "\n", + "Next, [load your model identifier credentials from an `.env` file](https://docs.validmind.ai/developer/model-documentation/store-credentials-in-env-file.html) or replace the placeholder with your own code snippet:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "eeda4c8c", + "metadata": {}, + "outputs": [], + "source": [ + "# Load your model identifier credentials from an `.env` file\n", + "\n", + "%load_ext dotenv\n", + "%dotenv .env\n", + "\n", + "# Or replace with your code snippet\n", + "\n", + "import validmind as vm\n", + "\n", + "vm.init(\n", + " # api_host=\"...\",\n", + " # api_key=\"...\",\n", + " # api_secret=\"...\",\n", + " # model=\"...\",\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "62f24552", + "metadata": {}, + "source": [ + "\n", + "\n", + "## Using PII detection" + ] + }, + { + "cell_type": "markdown", + "id": "fd9b6e44", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Create a custom test that outputs PII\n", + "\n", + "To demonstrate the feature, we'll need a test that outputs PII. First we'll create a custom test that returns:\n", + "\n", + "- A description string containing PII (name, email, phone)\n", + "- A small table containing PII in columns\n", + "\n", + "This output mirrors the structure used in other custom test notebooks and will exercise both table and description PII detection paths. However, if structured detection is unavailable, the library falls back to token-level text scans when possible." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "04d8c802", + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "\n", + "from validmind import test\n", + "\n", + "@test(\"pii_demo.PIIDetection\")\n", + "def pii_custom_test():\n", + " \"\"\"A custom test that returns demo PII.\n", + " This default test description will display when PII is not sent to the LLM to generate test descriptions based on test result data.\"\"\"\n", + " return pd.DataFrame(\n", + " {\n", + " \"name\": [\"Jane Smith\", \"John Doe\", \"Alice Johnson\"],\n", + " \"email\": [\n", + " \"jane.smith@bank.example\",\n", + " \"john.doe@company.example\",\n", + " \"alice.johnson@service.example\",\n", + " ],\n", + " \"phone\": [\"(212) 555-9876\", \"(415) 555-1234\", \"(646) 555-5678\"],\n", + " }\n", + " )" + ] + }, + { + "cell_type": "markdown", + "id": "53e02410", + "metadata": {}, + "source": [ + "
Want to learn more about custom tests?\n", + "

\n", + "Check out our extended introduction to custom tests — Implement custom tests
" + ] + }, + { + "cell_type": "markdown", + "id": "c4065f2a", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Run test under different PII detection modes\n", + "\n", + "Next, let's import [the `run_test` function](https://docs.validmind.ai/validmind/validmind/tests.html#run_test) provided by the `validmind.tests` module to run our custom test via a function called `run_pii_test()` that catches exceptions to observe blocking behavior when PII is present:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b42288e5", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "from validmind.tests import run_test\n", + "\n", + "# Run test and tag result with unique `result_id`\n", + "def run_pii_test(result_id=\"\"):\n", + " try:\n", + " test_name = f\"pii_demo.PIIDetection:{result_id}\"\n", + " result = run_test(test_name)\n", + "\n", + " # Check if the test description was generated by LLM\n", + " if not result._was_description_generated:\n", + " print(\"PII detected: LLM-generated test description skipped\")\n", + " else:\n", + " print(\"No PII detected or detection disabled: Test description generated by LLM\")\n", + "\n", + " # Try logging test results to the ValidMind Platform\n", + " result.log()\n", + " print(\"No PII detected or detection disabled: Test results logged to the ValidMind Platform\")\n", + " except Exception as e:\n", + " print(\"PII detected: Test results not logged to the ValidMind Platform\")" + ] + }, + { + "cell_type": "markdown", + "id": "867dbd94", + "metadata": {}, + "source": [ + "We'll then switch the `VALIDMIND_PII_DETECTION` environment variable across modes in the below examples.\n", + "\n", + "
Note that since we are running a custom test that does not exist in your model's default documentation template, we'll receive output indicating that a test-driven block doesn't currently exist in your model's documentation for that particular test ID.\n", + "

\n", + "That's expected, as when we run custom tests the results logged need to be manually added to your documentation within the ValidMind Platform or added to your documentation template.
" + ] + }, + { + "cell_type": "markdown", + "id": "0e151763", + "metadata": {}, + "source": [ + "\n", + "\n", + "#### disabled\n", + "\n", + "When detection is set to `disabled`, tests run and generate test descriptions. Logging tests with [`.log()`](https://docs.validmind.ai/validmind/validmind/vm_models.html#TestResult.log) will also send test descriptions and test results to the ValidMind Platform as usual:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3078af64", + "metadata": {}, + "outputs": [], + "source": [ + "print(\"\\n=== Mode: disabled ===\")\n", + "os.environ[\"VALIDMIND_PII_DETECTION\"] = \"disabled\"\n", + "\n", + "# Run test and tag result with unique ID `disabled`\n", + "run_pii_test(\"disabled\")" + ] + }, + { + "cell_type": "markdown", + "id": "c797d2e3", + "metadata": {}, + "source": [ + "\n", + "\n", + "#### test_results\n", + "\n", + "When detection is set for `test_results`, tests run and generate test descriptions for review in your environment, but logging tests will not send descriptions or test results to the ValidMind Platform:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "12e61a80", + "metadata": {}, + "outputs": [], + "source": [ + "print(\"\\n=== Mode: test_results ===\")\n", + "os.environ[\"VALIDMIND_PII_DETECTION\"] = \"test_results\"\n", + "\n", + "# Run test and tag result with unique ID `results_blocked`\n", + "run_pii_test(\"results_blocked\")" + ] + }, + { + "cell_type": "markdown", + "id": "9d5cb41c", + "metadata": {}, + "source": [ + "\n", + "\n", + "#### test_descriptions\n", + "\n", + "When detection is set for `test_descriptions`, tests run but will not generate test descriptions, and logging tests will not send descriptions but will send test results to the ValidMind Platform:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "feba6207", + "metadata": {}, + "outputs": [], + "source": [ + "print(\"\\n=== Mode: test_descriptions ===\")\n", + "os.environ[\"VALIDMIND_PII_DETECTION\"] = \"test_descriptions\"\n", + "\n", + "# Run test and tag result with unique ID `desc_blocked`\n", + "run_pii_test(\"desc_blocked\")" + ] + }, + { + "cell_type": "markdown", + "id": "1d3d7256", + "metadata": {}, + "source": [ + "\n", + "\n", + "#### all\n", + "\n", + "When detection is set to `all`, tests run will not generate test descriptions or log test results to the ValidMind Platform." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "af5040b5", + "metadata": {}, + "outputs": [], + "source": [ + "print(\"\\n=== Mode: all ===\")\n", + "os.environ[\"VALIDMIND_PII_DETECTION\"] = \"all\"\n", + "\n", + "# Run test and tag result with unique ID `all_blocked`\n", + "run_pii_test(\"all_blocked\")" + ] + }, + { + "cell_type": "markdown", + "id": "b1a5fd8e", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Override detection\n", + "\n", + "You can override blocking by passing `unsafe=True` to `result.log(unsafe=True)`, but this is not recommended outside controlled workflows.\n", + "\n", + "To demonstrate, let's rerun our custom test with some override scenarios." + ] + }, + { + "cell_type": "markdown", + "id": "8a378b22", + "metadata": {}, + "source": [ + "\n", + "\n", + "#### Override test result logging\n", + "\n", + "First, let's rerun our custom test with detection set to `all`, which will send the test results but not the test descriptions to the ValidMind Platform:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0387be21", + "metadata": {}, + "outputs": [], + "source": [ + "print(\"\\n=== Mode: all & unsafe=True ===\")\n", + "os.environ[\"VALIDMIND_PII_DETECTION\"] = \"all\"\n", + "\n", + "# Run test and tag result with unique ID `override_results`\n", + "try:\n", + " result = run_test(\"pii_demo.PIIDetection:override_results\")\n", + "\n", + " # Check if the test description was generated by LLM\n", + " if not result._was_description_generated:\n", + " print(\"PII detected: LLM-generated test description skipped\")\n", + " else:\n", + " print(\"No PII detected or detection disabled: Test description generated by LLM\")\n", + "\n", + " # Try logging test results to the ValidMind Platform\n", + " result.log(unsafe=True)\n", + " print(\"No PII detected, detection disabled, or override set: Test results logged to the ValidMind Platform\")\n", + "except Exception as e:\n", + " print(\"PII detected: Test results not logged to the ValidMind Platform\")" + ] + }, + { + "cell_type": "markdown", + "id": "8197c39c", + "metadata": {}, + "source": [ + "\n", + "\n", + "#### Override test descriptions and test result logging\n", + "\n", + "To send both the test descriptions and test results via override, set the `VALIDMIND_PII_DETECTION` environment variable to `test_results` while including the override flag:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b40a2670", + "metadata": {}, + "outputs": [], + "source": [ + "print(\"\\n=== Mode: test_results & unsafe=True ===\")\n", + "os.environ[\"VALIDMIND_PII_DETECTION\"] = \"test_results\"\n", + "\n", + "# Run test and tag result with unique ID `override_both`\n", + "try:\n", + " result = run_test(\"pii_demo.PIIDetection:override_both\")\n", + "\n", + " # Check if the test description was generated by LLM\n", + " if not result._was_description_generated:\n", + " print(\"PII detected: LLM-generated test description skipped\")\n", + " else:\n", + " print(\"No PII detected, detection disabled, or override set: Test description generated by LLM\")\n", + "\n", + " # Try logging test results to the ValidMind Platform\n", + " result.log(unsafe=True)\n", + " print(\"No PII detected, detection disabled, or override set: Test results logged to the ValidMind Platform\")\n", + "except Exception as e:\n", + " print(\"PII detected: Test results not logged to the ValidMind Platform\")" + ] + }, + { + "cell_type": "markdown", + "id": "f2ce4348", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Review logged test results\n", + "\n", + "Now let's take a look at the results that were logged to the ValidMind Platform:\n", + "\n", + "1. From the **Inventory** in the ValidMind Platform, go to the model you registered earlier.\n", + "\n", + "2. In the left sidebar that appears for your model, click **Documentation** under Documents.\n", + "\n", + "3. Click on any section heading to expand that section to add a new test-driven block ([Need more help?](https://docs.validmind.ai/developer/model-documentation/work-with-test-results.html)).\n", + "\n", + "4. Under TEST-DRIVEN in the sidebar, click **Custom**.\n", + "\n", + "5. Confirm that you're able to insert the following logged results:\n", + "\n", + " - `pii_demo.PIIDetection:disabled`\n", + " - `pii_demo.PIIDetection:desc_blocked`\n", + " - `pii_demo.PIIDetection:override_results`\n", + " - `pii_demo.PIIDetection:override_both`" + ] + }, + { + "cell_type": "markdown", + "id": "d034b04c", + "metadata": {}, + "source": [ + "\n", + "\n", + "## Troubleshooting\n", + "\n", + "- [x] If you see warnings that Presidio or Presidio analyzer is unavailable, ensure you installed extras: `validmind[pii-detection]`.\n", + "- [x] Ensure your environment is restarted after installing new packages if imports fail." + ] + }, + { + "cell_type": "markdown", + "id": "1da184e0", + "metadata": {}, + "source": [ + "\n", + "\n", + "## Learn more\n", + "\n", + "We offer many interactive notebooks to help you automate testing, documenting, validating, and more:\n", + "\n", + "- [Run tests & test suites](https://docs.validmind.ai/developer/how-to/testing-overview.html)\n", + "- [Use ValidMind Library features](https://docs.validmind.ai/developer/how-to/feature-overview.html)\n", + "- [Code samples by use case](https://docs.validmind.ai/guide/samples-jupyter-notebooks.html)\n", + "\n", + "Or, visit our [documentation](https://docs.validmind.ai/) to learn more about ValidMind." + ] + }, + { + "cell_type": "markdown", + "id": "bcaf7fd4", + "metadata": {}, + "source": [ + "\n", + "\n", + "## Upgrade ValidMind\n", + "\n", + "
After installing ValidMind, you'll want to periodically make sure you are on the latest version to access any new features and other enhancements.
\n", + "\n", + "Retrieve the information for the currently installed version of ValidMind:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dffb39a5", + "metadata": {}, + "outputs": [], + "source": [ + "%pip show validmind" + ] + }, + { + "cell_type": "markdown", + "id": "9e9f387d", + "metadata": {}, + "source": [ + "If the version returned is lower than the version indicated in our [production open-source code](https://github.com/validmind/validmind-library/blob/prod/validmind/__version__.py), restart your notebook and run:\n", + "\n", + "```bash\n", + "%pip install --upgrade validmind\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "faf6cb0d", + "metadata": {}, + "source": [ + "You may need to restart your kernel after running the upgrade package for changes to be applied." + ] + }, + { + "cell_type": "markdown", + "id": "copyright-096666bfeef04fd7802d45e5dc221ca2", + "metadata": {}, + "source": [ + "\n", + "\n", + "\n", + "\n", + "***\n", + "\n", + "Copyright © 2023-2026 ValidMind Inc. All rights reserved.
\n", + "Refer to [LICENSE](https://github.com/validmind/validmind-library/blob/main/LICENSE) for details.
\n", + "SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial
" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.10" + } + }, + "nbformat": 4, + "nbformat_minor": 5 } diff --git a/notebooks/tutorials/model_development/4-finalize_testing_documentation.ipynb b/notebooks/tutorials/model_development/4-finalize_testing_documentation.ipynb index 5af7fb987..07643d1cb 100644 --- a/notebooks/tutorials/model_development/4-finalize_testing_documentation.ipynb +++ b/notebooks/tutorials/model_development/4-finalize_testing_documentation.ipynb @@ -960,6 +960,7 @@ }, { "cell_type": "markdown", + "id": "copyright-75cfc55507924d27b0d37b140c473293", "metadata": {}, "source": [ "\n", diff --git a/notebooks/tutorials/model_validation/4-finalize_validation_reporting.ipynb b/notebooks/tutorials/model_validation/4-finalize_validation_reporting.ipynb index 59a5b3ebe..88f0ea845 100644 --- a/notebooks/tutorials/model_validation/4-finalize_validation_reporting.ipynb +++ b/notebooks/tutorials/model_validation/4-finalize_validation_reporting.ipynb @@ -1205,6 +1205,7 @@ }, { "cell_type": "markdown", + "id": "copyright-b55920b2495443d1894125f60e582bb4", "metadata": {}, "source": [ "\n", diff --git a/notebooks/use_cases/agents/banking_tools.py b/notebooks/use_cases/agents/banking_tools.py index a3f08c6eb..298bdadc6 100644 --- a/notebooks/use_cases/agents/banking_tools.py +++ b/notebooks/use_cases/agents/banking_tools.py @@ -1,6 +1,7 @@ from typing import Optional from datetime import datetime from langchain.tools import tool +from deepeval.tracing import observe def _score_dti_ratio(dti_ratio: float) -> int: @@ -79,6 +80,7 @@ def _get_credit_description(credit_score: int) -> str: # Credit Risk Analyzer Tool @tool +@observe(type="tool") def credit_risk_analyzer( customer_income: float, customer_debt: float, @@ -279,8 +281,8 @@ def _handle_recommend_product(customer): def _handle_get_info(customer, customer_id): """Handle get info action.""" - credit_tier = ('Excellent' if customer['credit_score'] >= 750 else - 'Good' if customer['credit_score'] >= 700 else + credit_tier = ('Excellent' if customer['credit_score'] >= 750 else + 'Good' if customer['credit_score'] >= 700 else 'Fair' if customer['credit_score'] >= 650 else 'Poor') return f"""CUSTOMER ACCOUNT INFORMATION @@ -308,6 +310,7 @@ def _handle_get_info(customer, customer_id): # Customer Account Manager Tool @tool +@observe(type="tool") def customer_account_manager( account_type: str, customer_id: str, @@ -362,6 +365,7 @@ def customer_account_manager( # Fraud Detection System Tool @tool +@observe(type="tool") def fraud_detection_system( transaction_id: str, customer_id: str, diff --git a/notebooks/use_cases/agents/document_agentic_ai.ipynb b/notebooks/use_cases/agents/document_agentic_ai.ipynb index 27432b461..bc551fdf6 100644 --- a/notebooks/use_cases/agents/document_agentic_ai.ipynb +++ b/notebooks/use_cases/agents/document_agentic_ai.ipynb @@ -1,2198 +1,2178 @@ { - "cells": [ - { - "cell_type": "markdown", - "id": "e7277c38", - "metadata": {}, - "source": [ - "# Document an agentic AI system\n", - "\n", - "Build and document an agentic AI system with the ValidMind Library. Construct a LangGraph-based banking agent, assign AI evaluation metric scores to your agent, and run accuracy, RAGAS, and safety tests, then log those test results to the ValidMind Platform.\n", - "\n", - "An _AI agent_ is an autonomous system that interprets inputs, selects from available tools or actions, and executes multi-step behaviors to achieve defined goals. In this notebook, the agent acts as a banking assistant that analyzes user requests and automatically selects and invokes the appropriate specialized banking tool to deliver accurate, compliant, and actionable responses.\n", - "\n", - "- This agent enables financial institutions to automate complex banking workflows where different customer requests require different specialized tools and knowledge bases.\n", - "- Effective validation of agentic AI systems reduces the risks of agents misinterpreting inputs, failing to extract required parameters, or producing incorrect assessments or actions — such as selecting the wrong tool.\n", - "\n", - "
For the LLM components in this notebook to function properly, you'll need access to OpenAI.\n", - "

\n", - "Before you continue, ensure that a valid OPENAI_API_KEY is set in your .env file.
" - ] - }, - { - "cell_type": "markdown", - "id": "a47dd942", - "metadata": {}, - "source": [ - "::: {.content-hidden when-format=\"html\"}\n", - "## Contents \n", - "- [About ValidMind](#toc1__) \n", - " - [Before you begin](#toc1_1__) \n", - " - [New to ValidMind?](#toc1_2__) \n", - " - [Key concepts](#toc1_3__) \n", - "- [Setting up](#toc2__) \n", - " - [Install the ValidMind Library](#toc2_1__) \n", - " - [Initialize the ValidMind Library](#toc2_2__) \n", - " - [Register sample model](#toc2_2_1__) \n", - " - [Apply documentation template](#toc2_2_2__) \n", - " - [Get your code snippet](#toc2_2_3__) \n", - " - [Preview the documentation template](#toc2_2_4__) \n", - " - [Verify OpenAI API access](#toc2_3__) \n", - " - [Initialize the Python environment](#toc2_4__) \n", - "- [Building the LangGraph agent](#toc3__) \n", - " - [Test available banking tools](#toc3_1__) \n", - " - [Create LangGraph banking agent](#toc3_2__) \n", - " - [Define system prompt](#toc3_2_1__) \n", - " - [Initialize the LLM](#toc3_2_2__) \n", - " - [Define agent state structure](#toc3_2_3__) \n", - " - [Create agent workflow function](#toc3_2_4__) \n", - " - [Instantiate the banking agent](#toc3_2_5__) \n", - " - [Integrate agent with ValidMind](#toc3_3__) \n", - " - [Import ValidMind components](#toc3_3_1__) \n", - " - [Create agent wrapper function](#toc3_3_2__) \n", - " - [Initialize the ValidMind model object](#toc3_3_3__) \n", - " - [Store the agent reference](#toc3_3_4__) \n", - " - [Verify integration](#toc3_3_5__) \n", - " - [Validate the system prompt](#toc3_4__) \n", - "- [Initialize the ValidMind datasets](#toc4__) \n", - " - [Assign predictions](#toc4_1__) \n", - "- [Running accuracy tests](#toc5__) \n", - " - [Response accuracy test](#toc5_1__) \n", - " - [Tool selection accuracy test](#toc5_2__) \n", - "- [Assigning AI evaluation metric scores](#toc6__) \n", - " - [Identify relevant DeepEval scorers](#toc6_1__) \n", - " - [Assign reasoning scores](#toc6_2__) \n", - " - [Plan quality score](#toc6_2_1__) \n", - " - [Plan adherence score](#toc6_2_2__) \n", - " - [Assign action scores](#toc6_3__) \n", - " - [Tool correctness score](#toc6_3_1__) \n", - " - [Argument correctness score](#toc6_3_2__) \n", - " - [Assign execution scores](#toc6_4__) \n", - " - [Task completion score](#toc6_4_1__) \n", - "- [Running RAGAS tests](#toc7__) \n", - " - [Identify relevant RAGAS tests](#toc7_1__) \n", - " - [Faithfulness](#toc7_1_1__) \n", - " - [Response Relevancy](#toc7_1_2__) \n", - " - [Context Recall](#toc7_1_3__) \n", - "- [Running safety tests](#toc8__) \n", - " - [AspectCritic](#toc8_1_1__) \n", - " - [Bias](#toc8_1_2__) \n", - "- [Next steps](#toc9__) \n", - " - [Work with your model documentation](#toc9_1__) \n", - " - [Customize the banking agent for your use case](#toc9_2__) \n", - " - [Discover more learning resources](#toc9_3__) \n", - "- [Upgrade ValidMind](#toc10__) \n", - "\n", - ":::\n", - "\n", - "" - ] - }, - { - "cell_type": "markdown", - "id": "ecaad35f", - "metadata": {}, - "source": [ - "\n", - "\n", - "## About ValidMind\n", - "\n", - "ValidMind is a suite of tools for managing model risk, including risk associated with AI and statistical models. \n", - "\n", - "You use the ValidMind Library to automate documentation and validation tests, and then use the ValidMind Platform to collaborate on model documentation. Together, these products simplify model risk management, facilitate compliance with regulations and institutional standards, and enhance collaboration between yourself and model validators." - ] - }, - { - "cell_type": "markdown", - "id": "6ff1f9ef", - "metadata": {}, - "source": [ - "\n", - "\n", - "### Before you begin\n", - "\n", - "This notebook assumes you have basic familiarity with Python, including an understanding of how functions work. If you are new to Python, you can still run the notebook but we recommend further familiarizing yourself with the language. \n", - "\n", - "If you encounter errors due to missing modules in your Python environment, install the modules with `pip install`, and then re-run the notebook. For more help, refer to [Installing Python Modules](https://docs.python.org/3/installing/index.html)." - ] - }, - { - "cell_type": "markdown", - "id": "d7ad8d8c", - "metadata": {}, - "source": [ - "\n", - "\n", - "### New to ValidMind?\n", - "\n", - "If you haven't already seen our documentation on the [ValidMind Library](https://docs.validmind.ai/developer/validmind-library.html), we recommend you begin by exploring the available resources in this section. There, you can learn more about documenting models and running tests, as well as find code samples and our Python Library API reference.\n", - "\n", - "
For access to all features available in this notebook, you'll need access to a ValidMind account.\n", - "

\n", - "Register with ValidMind
" - ] - }, - { - "cell_type": "markdown", - "id": "323caa59", - "metadata": {}, - "source": [ - "\n", - "\n", - "### Key concepts\n", - "\n", - "**Model documentation**: A structured and detailed record pertaining to a model, encompassing key components such as its underlying assumptions, methodologies, data sources, inputs, performance metrics, evaluations, limitations, and intended uses. It serves to ensure transparency, adherence to regulatory requirements, and a clear understanding of potential risks associated with the model’s application.\n", - "\n", - "**Documentation template**: Functions as a test suite and lays out the structure of model documentation, segmented into various sections and sub-sections. Documentation templates define the structure of your model documentation, specifying the tests that should be run, and how the results should be displayed.\n", - "\n", - "**Tests**: A function contained in the ValidMind Library, designed to run a specific quantitative test on the dataset or model. Tests are the building blocks of ValidMind, used to evaluate and document models and datasets, and can be run individually or as part of a suite defined by your model documentation template.\n", - "\n", - "**Metrics**: A subset of tests that do not have thresholds. In the context of this notebook, metrics and tests can be thought of as interchangeable concepts.\n", - "\n", - "**Custom metrics**: Custom metrics are functions that you define to evaluate your model or dataset. These functions can be registered with the ValidMind Library to be used in the ValidMind Platform.\n", - "\n", - "**Inputs**: Objects to be evaluated and documented in the ValidMind Library. They can be any of the following:\n", - "\n", - " - **model**: A single model that has been initialized in ValidMind with [`vm.init_model()`](https://docs.validmind.ai/validmind/validmind.html#init_model).\n", - " - **dataset**: Single dataset that has been initialized in ValidMind with [`vm.init_dataset()`](https://docs.validmind.ai/validmind/validmind.html#init_dataset).\n", - " - **models**: A list of ValidMind models - usually this is used when you want to compare multiple models in your custom metric.\n", - " - **datasets**: A list of ValidMind datasets - usually this is used when you want to compare multiple datasets in your custom metric. (Learn more: [Run tests with multiple datasets](https://docs.validmind.ai/notebooks/how_to/tests/run_tests/configure_tests/run_tests_that_require_multiple_datasets.html))\n", - "\n", - "**Parameters**: Additional arguments that can be passed when running a ValidMind test, used to pass additional information to a metric, customize its behavior, or provide additional context.\n", - "\n", - "**Outputs**: Custom metrics can return elements like tables or plots. Tables may be a list of dictionaries (each representing a row) or a pandas DataFrame. Plots may be matplotlib or plotly figures.\n", - "\n", - "**Test suites**: Collections of tests designed to run together to automate and generate model documentation end-to-end for specific use-cases.\n", - "\n", - "Example: the [`classifier_full_suite`](https://docs.validmind.ai/validmind/validmind/test_suites/classifier.html#ClassifierFullSuite) test suite runs tests from the [`tabular_dataset`](https://docs.validmind.ai/validmind/validmind/test_suites/tabular_datasets.html) and [`classifier`](https://docs.validmind.ai/validmind/validmind/test_suites/classifier.html) test suites to fully document the data and model sections for binary classification model use-cases." - ] - }, - { - "cell_type": "markdown", - "id": "ddba5169", - "metadata": {}, - "source": [ - "\n", - "\n", - "## Setting up" - ] - }, - { - "cell_type": "markdown", - "id": "b53da99c", - "metadata": {}, - "source": [ - "\n", - "\n", - "### Install the ValidMind Library\n", - "\n", - "
Recommended Python versions\n", - "

\n", - "Python 3.8 <= x <= 3.11
\n", - "\n", - "Let's begin by installing the ValidMind Library with large language model (LLM) support:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "1982a118", - "metadata": {}, - "outputs": [], - "source": [ - "%pip install -q \"validmind[llm]\" \"langgraph==0.3.21\"" - ] - }, - { - "cell_type": "markdown", - "id": "dc9dea3a", - "metadata": {}, - "source": [ - "\n", - "\n", - "### Initialize the ValidMind Library" - ] - }, - { - "cell_type": "markdown", - "id": "5848461e", - "metadata": {}, - "source": [ - "\n", - "\n", - "#### Register sample model\n", - "\n", - "Let's first register a sample model for use with this notebook.\n", - "\n", - "1. In a browser, [log in to ValidMind](https://docs.validmind.ai/guide/configuration/log-in-to-validmind.html).\n", - "\n", - "2. In the left sidebar, navigate to **Inventory** and click **+ Register Model**.\n", - "\n", - "3. Enter the model details and click **Next >** to continue to assignment of model stakeholders. ([Need more help?](https://docs.validmind.ai/guide/model-inventory/register-models-in-inventory.html))\n", - "\n", - "4. Select your own name under the **MODEL OWNER** drop-down.\n", - "\n", - "5. Click **Register Model** to add the model to your inventory." - ] - }, - { - "cell_type": "markdown", - "id": "97d0b04b", - "metadata": {}, - "source": [ - "\n", - "\n", - "#### Apply documentation template\n", - "\n", - "Once you've registered your model, let's select a documentation template. A template predefines sections for your model documentation and provides a general outline to follow, making the documentation process much easier.\n", - "\n", - "1. In the left sidebar that appears for your model, click **Documents** and select **Documentation**.\n", - "\n", - "2. Under **TEMPLATE**, select `Agentic AI`.\n", - "\n", - "3. Click **Use Template** to apply the template." - ] - }, - { - "cell_type": "markdown", - "id": "b279d5fa", - "metadata": {}, - "source": [ - "
Can't select this template?\n", - "

\n", - "Your organization administrators may need to add it to your template library:\n", - "\n", - "
" - ] - }, - { - "cell_type": "markdown", - "id": "3606cb8c", - "metadata": {}, - "source": [ - "\n", - "\n", - "#### Get your code snippet\n", - "\n", - "ValidMind generates a unique _code snippet_ for each registered model to connect with your developer environment. You initialize the ValidMind Library with this code snippet, which ensures that your documentation and tests are uploaded to the correct model when you run the notebook.\n", - "\n", - "1. On the left sidebar that appears for your model, select **Getting Started** and click **Copy snippet to clipboard**.\n", - "2. Next, [load your model identifier credentials from an `.env` file](https://docs.validmind.ai/developer/model-documentation/store-credentials-in-env-file.html) or replace the placeholder with your own code snippet:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d6ccbefc", - "metadata": {}, - "outputs": [], - "source": [ - "# Load your model identifier credentials from an `.env` file\n", - "\n", - "%load_ext dotenv\n", - "%dotenv .env\n", - "\n", - "# Or replace with your code snippet\n", - "\n", - "import validmind as vm\n", - "\n", - "vm.init(\n", - " # api_host=\"...\",\n", - " # api_key=\"...\",\n", - " # api_secret=\"...\",\n", - " # model=\"...\",\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "2ed79cf0", - "metadata": {}, - "source": [ - "\n", - "\n", - "#### Preview the documentation template\n", - "\n", - "Let's verify that you have connected the ValidMind Library to the ValidMind Platform and that the appropriate *template* is selected for your model.\n", - "\n", - "You will upload documentation and test results unique to your model based on this template later on. For now, **take a look at the default structure that the template provides with [the `vm.preview_template()` function](https://docs.validmind.ai/validmind/validmind.html#preview_template)** from the ValidMind library and note the empty sections:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "dffdaa6f", - "metadata": {}, - "outputs": [], - "source": [ - "vm.preview_template()" - ] - }, - { - "cell_type": "markdown", - "id": "b5c5ba68", - "metadata": {}, - "source": [ - "\n", - "\n", - "### Verify OpenAI API access\n", - "\n", - "Verify that a valid `OPENAI_API_KEY` is set in your `.env` file:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "22cc39cb", - "metadata": {}, - "outputs": [], - "source": [ - "# Load environment variables if using .env file\n", - "try:\n", - " from dotenv import load_dotenv\n", - " load_dotenv()\n", - "except ImportError:\n", - " print(\"dotenv not installed. Make sure OPENAI_API_KEY is set in your environment.\")" - ] - }, - { - "cell_type": "markdown", - "id": "e4a9d3a9", - "metadata": {}, - "source": [ - "\n", - "\n", - "### Initialize the Python environment\n", - "\n", - "Let's import all the necessary libraries to prepare for building our banking LangGraph agentic system:\n", - "\n", - "- **Standard libraries** for data handling and environment management.\n", - "- **pandas**, a Python library for data manipulation and analytics, as an alias. We'll also configure pandas to show all columns and all rows at full width for easier debugging and inspection.\n", - "- **LangChain** components for LLM integration and tool management.\n", - "- **LangGraph** for building stateful, multi-step agent workflows.\n", - "- **Banking tools** for specialized financial services as defined in [banking_tools.py](banking_tools.py)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "2058d1ac", - "metadata": {}, - "outputs": [], - "source": [ - "# STANDARD LIBRARY IMPORTS\n", - "\n", - "# TypedDict: Defines type-safe dictionaries for the agent's state structure\n", - "# Annotated: Adds metadata to type hints\n", - "# Sequence: Type hint for sequences used in the agent\n", - "from typing import TypedDict, Annotated, Sequence\n", - "\n", - "# THIRD PARTY IMPORTS\n", - "\n", - "import pandas as pd\n", - "# Configure pandas to show all columns and all rows at full width\n", - "pd.set_option('display.max_columns', None)\n", - "pd.set_option('display.max_colwidth', None)\n", - "pd.set_option('display.width', None)\n", - "pd.set_option('display.max_rows', None)\n", - "\n", - "# BaseMessage: Represents a base message in the LangChain message system\n", - "# HumanMessage: Represents a human message in the LangChain message system\n", - "# SystemMessage: Represents a system message in the LangChain message system\n", - "from langchain_core.messages import BaseMessage, HumanMessage, SystemMessage\n", - "\n", - "# ChatOpenAI: Represents an OpenAI chat model in the LangChain library\n", - "from langchain_openai import ChatOpenAI\n", - "\n", - "# MemorySaver: Represents a checkpoint for saving and restoring agent state\n", - "from langgraph.checkpoint.memory import MemorySaver\n", - "\n", - "# StateGraph: Represents a stateful graph in the LangGraph library\n", - "# END: Represents the end of a graph\n", - "# START: Represents the start of a graph\n", - "from langgraph.graph import StateGraph, END, START\n", - "\n", - "# add_messages: Adds messages to the state\n", - "from langgraph.graph.message import add_messages\n", - "\n", - "# ToolNode: Represents a tool node in the LangGraph library\n", - "from langgraph.prebuilt import ToolNode\n", - "\n", - "# LOCAL IMPORTS FROM banking_tools.py\n", - "\n", - "from banking_tools import AVAILABLE_TOOLS" - ] - }, - { - "cell_type": "markdown", - "id": "e109d075", - "metadata": {}, - "source": [ - "\n", - "\n", - "## Building the LangGraph agent" - ] - }, - { - "cell_type": "markdown", - "id": "15040411", - "metadata": {}, - "source": [ - "\n", - "\n", - "### Test available banking tools\n", - "\n", - "We'll use the demo banking tools defined in `banking_tools.py` that provide use cases of financial services:\n", - "\n", - "- **Credit Risk Analyzer** - Loan applications and credit decisions\n", - "- **Customer Account Manager** - Account services and customer support\n", - "- **Fraud Detection System** - Security and fraud prevention" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "1e0a120c", - "metadata": {}, - "outputs": [], - "source": [ - "print(f\"Available tools: {len(AVAILABLE_TOOLS)}\")\n", - "print(\"\\nTool Details:\")\n", - "for i, tool in enumerate(AVAILABLE_TOOLS, 1):\n", - " print(f\" - {tool.name}\")" - ] - }, - { - "cell_type": "markdown", - "id": "04d6785a", - "metadata": {}, - "source": [ - "Let's test each banking tool individually to ensure they're working correctly before integrating them into our agent:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "dc0caff2", - "metadata": {}, - "outputs": [], - "source": [ - "# Test 1: Credit Risk Analyzer\n", - "print(\"TEST 1: Credit Risk Analyzer\")\n", - "print(\"-\" * 40)\n", - "try:\n", - " # Access the underlying function using .func\n", - " credit_result = AVAILABLE_TOOLS[0].func(\n", - " customer_income=75000,\n", - " customer_debt=1200,\n", - " credit_score=720,\n", - " loan_amount=50000,\n", - " loan_type=\"personal\"\n", - " )\n", - " print(credit_result)\n", - " print(\"Credit Risk Analyzer test PASSED\")\n", - "except Exception as e:\n", - " print(f\"Credit Risk Analyzer test FAILED: {e}\")\n", - "\n", - "print(\"\" + \"=\" * 60)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "b6b227db", - "metadata": {}, - "outputs": [], - "source": [ - "\n", - "# Test 2: Customer Account Manager\n", - "print(\"TEST 2: Customer Account Manager\")\n", - "print(\"-\" * 40)\n", - "try:\n", - " # Test checking balance\n", - " account_result = AVAILABLE_TOOLS[1].func(\n", - " account_type=\"checking\",\n", - " customer_id=\"12345\",\n", - " action=\"check_balance\"\n", - " )\n", - " print(account_result)\n", - "\n", - " # Test getting account info\n", - " info_result = AVAILABLE_TOOLS[1].func(\n", - " account_type=\"all\",\n", - " customer_id=\"12345\", \n", - " action=\"get_info\"\n", - " )\n", - " print(info_result)\n", - " print(\"Customer Account Manager test PASSED\")\n", - "except Exception as e:\n", - " print(f\"Customer Account Manager test FAILED: {e}\")\n", - "\n", - "print(\"\" + \"=\" * 60)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a983b30d", - "metadata": {}, - "outputs": [], - "source": [ - "\n", - "# Test 3: Fraud Detection System\n", - "print(\"TEST 3: Fraud Detection System\")\n", - "print(\"-\" * 40)\n", - "try:\n", - " fraud_result = AVAILABLE_TOOLS[2].func(\n", - " transaction_id=\"TX123\",\n", - " customer_id=\"12345\",\n", - " transaction_amount=500.00,\n", - " transaction_type=\"withdrawal\",\n", - " location=\"Miami, FL\",\n", - " device_id=\"DEVICE_001\"\n", - " )\n", - " print(fraud_result)\n", - " print(\"Fraud Detection System test PASSED\")\n", - "except Exception as e:\n", - " print(f\"Fraud Detection System test FAILED: {e}\")\n", - "\n", - "print(\"\" + \"=\" * 60)" - ] - }, - { - "cell_type": "markdown", - "id": "6bf04845", - "metadata": {}, - "source": [ - "\n", - "\n", - "### Create LangGraph banking agent\n", - "\n", - "With our tools ready to go, we'll create our intelligent banking agent with LangGraph that automatically selects and uses the appropriate banking tool based on a user request." - ] - }, - { - "cell_type": "markdown", - "id": "31df57f0", - "metadata": {}, - "source": [ - "\n", - "\n", - "#### Define system prompt\n", - "\n", - "We'll begin by defining our system prompt, which provides the LLM with context about its role as a banking assistant and guidance on when to use each available tool:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7971c427", - "metadata": {}, - "outputs": [], - "source": [ - "\n", - "# Enhanced banking system prompt with tool selection guidance\n", - "system_context = \"\"\"You are a professional banking AI assistant with access to specialized banking tools.\n", - " Analyze the user's banking request and directly use the most appropriate tools to help them.\n", - " \n", - " AVAILABLE BANKING TOOLS:\n", - " \n", - " credit_risk_analyzer - Analyze credit risk for loan applications and credit decisions\n", - " - Use for: loan applications, credit assessments, risk analysis, mortgage eligibility\n", - " - Examples: \"Analyze credit risk for $50k personal loan\", \"Assess mortgage eligibility for $300k home purchase\"\n", - " - Parameters: customer_income, customer_debt, credit_score, loan_amount, loan_type\n", - "\n", - " customer_account_manager - Manage customer accounts and provide banking services\n", - " - Use for: account information, transaction processing, product recommendations, customer service\n", - " - Examples: \"Check balance for checking account 12345\", \"Recommend products for customer with high balance\"\n", - " - Parameters: account_type, customer_id, action, amount, account_details\n", - "\n", - " fraud_detection_system - Analyze transactions for potential fraud and security risks\n", - " - Use for: transaction monitoring, fraud prevention, risk assessment, security alerts\n", - " - Examples: \"Analyze fraud risk for $500 ATM withdrawal in Miami\", \"Check security for $2000 online purchase\"\n", - " - Parameters: transaction_id, customer_id, transaction_amount, transaction_type, location, device_id\n", - "\n", - " BANKING INSTRUCTIONS:\n", - " - Analyze the user's banking request carefully and identify the primary need\n", - " - If they need credit analysis → use credit_risk_analyzer\n", - " - If they need financial calculations → use financial_calculator\n", - " - If they need account services → use customer_account_manager\n", - " - If they need security analysis → use fraud_detection_system\n", - " - Extract relevant parameters from the user's request\n", - " - Provide helpful, accurate banking responses based on tool outputs\n", - " - Always consider banking regulations, risk management, and best practices\n", - " - Be professional and thorough in your analysis\n", - "\n", - " Choose and use tools wisely to provide the most helpful banking assistance.\n", - " Describe the response in user friendly manner with details describing the tool output. \n", - " Provide the response in at least 500 words.\n", - " Generate a concise execution plan for the banking request.\n", - " \"\"\"" - ] - }, - { - "cell_type": "markdown", - "id": "406835c8", - "metadata": {}, - "source": [ - "\n", - "\n", - "#### Initialize the LLM\n", - "\n", - "Let's initialize the LLM that will power our banking agent:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "866066e7", - "metadata": {}, - "outputs": [], - "source": [ - "# Initialize the main LLM for banking responses\n", - "main_llm = ChatOpenAI(\n", - " model=\"gpt-5-mini\",\n", - " reasoning={\n", - " \"effort\": \"low\",\n", - " \"summary\": \"auto\"\n", - " }\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "cce9685c", - "metadata": {}, - "source": [ - "Then bind the available banking tools to the LLM, enabling the model to automatically recognize and invoke each tool when appropriate based on request input and the system prompt we defined above:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "906d8132", - "metadata": {}, - "outputs": [], - "source": [ - "# Bind all banking tools to the main LLM\n", - "llm_with_tools = main_llm.bind_tools(AVAILABLE_TOOLS)" - ] - }, - { - "cell_type": "markdown", - "id": "2bad8799", - "metadata": {}, - "source": [ - "\n", - "\n", - "#### Define agent state structure\n", - "\n", - "The agent state defines the data structure that flows through the LangGraph workflow. It includes:\n", - "\n", - "- **messages** — The conversation history between the user and agent\n", - "- **user_input** — The current user request\n", - "- **session_id** — A unique identifier for the conversation session\n", - "- **context** — Additional context that can be passed between nodes\n", - "\n", - "Defining this state structure maintains the structure throughout the agent's execution and allows for multi-turn conversations with memory:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "6b926ddf", - "metadata": {}, - "outputs": [], - "source": [ - "# Banking Agent State Definition\n", - "class BankingAgentState(TypedDict):\n", - " messages: Annotated[Sequence[BaseMessage], add_messages]\n", - " user_input: str\n", - " session_id: str\n", - " context: dict" - ] - }, - { - "cell_type": "markdown", - "id": "47ce81b7", - "metadata": {}, - "source": [ - "\n", - "\n", - "#### Create agent workflow function\n", - "\n", - "We'll build the LangGraph agent workflow with two main components:\n", - "\n", - "1. **LLM node** — Processes user requests, applies the system prompt, and decides whether to use tools.\n", - "2. **Tools node** — Executes the selected banking tools when the LLM determines they're needed.\n", - "\n", - "The workflow begins with the LLM analyzing the request, then uses tools if needed — or ends if the response is complete, and finally returns to the LLM to generate the final response." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "2c9bf585", - "metadata": {}, - "outputs": [], - "source": [ - "def create_banking_langgraph_agent():\n", - " \"\"\"Create a comprehensive LangGraph banking agent with intelligent tool selection.\"\"\"\n", - " def llm_node(state: BankingAgentState) -> BankingAgentState:\n", - " \"\"\"Main LLM node that processes banking requests and selects appropriate tools.\"\"\"\n", - " messages = state[\"messages\"]\n", - " # Add system context to messages\n", - " enhanced_messages = [SystemMessage(content=system_context)] + list(messages)\n", - " # Get LLM response with tool selection\n", - " response = llm_with_tools.invoke(enhanced_messages)\n", - " return {\n", - " **state,\n", - " \"messages\": messages + [response]\n", - " }\n", - " \n", - " def should_continue(state: BankingAgentState) -> str:\n", - " \"\"\"Decide whether to use tools or end the conversation.\"\"\"\n", - " last_message = state[\"messages\"][-1]\n", - " # Check if the LLM wants to use tools\n", - " if hasattr(last_message, 'tool_calls') and last_message.tool_calls:\n", - " return \"tools\"\n", - " return END\n", - " \n", - " # Create the banking state graph\n", - " workflow = StateGraph(BankingAgentState)\n", - " # Add nodes\n", - " workflow.add_node(\"llm\", llm_node)\n", - " workflow.add_node(\"tools\", ToolNode(AVAILABLE_TOOLS))\n", - " # Simplified entry point - go directly to LLM\n", - " workflow.add_edge(START, \"llm\")\n", - " # From LLM, decide whether to use tools or end\n", - " workflow.add_conditional_edges(\n", - " \"llm\",\n", - " should_continue,\n", - " {\"tools\": \"tools\", END: END}\n", - " )\n", - " # Tool execution flows back to LLM for final response\n", - " workflow.add_edge(\"tools\", \"llm\")\n", - " # Set up memory\n", - " memory = MemorySaver()\n", - " # Compile the graph\n", - " agent = workflow.compile(checkpointer=memory)\n", - " return agent" - ] - }, - { - "cell_type": "markdown", - "id": "3eb40287", - "metadata": {}, - "source": [ - "\n", - "\n", - "#### Instantiate the banking agent\n", - "\n", - "Now, we'll create an instance of the banking agent by calling the workflow creation function.\n", - "\n", - "This compiled agent is ready to process banking requests and will automatically select and use the appropriate tools based on user queries:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "455b8ee4", - "metadata": {}, - "outputs": [], - "source": [ - "# Create the banking intelligent agent\n", - "banking_agent = create_banking_langgraph_agent()\n", - "\n", - "print(\"Banking LangGraph Agent Created Successfully!\")\n", - "print(\"\\nFeatures:\")\n", - "print(\" - Intelligent banking tool selection\")\n", - "print(\" - Comprehensive banking system prompt\")\n", - "print(\" - Streamlined workflow: LLM → Tools → Response\")\n", - "print(\" - Automatic tool parameter extraction\")\n", - "print(\" - Professional banking assistance\")" - ] - }, - { - "cell_type": "markdown", - "id": "12691528", - "metadata": {}, - "source": [ - "\n", - "\n", - "### Integrate agent with ValidMind\n", - "\n", - "To integrate our LangGraph banking agent with ValidMind, we need to create a wrapper function that ValidMind can use to invoke the agent and extract the necessary information for testing and documentation, allowing ValidMind to run validation tests on the agent's behavior, tool usage, and responses." - ] - }, - { - "cell_type": "markdown", - "id": "7b78509b", - "metadata": {}, - "source": [ - "\n", - "\n", - "#### Import ValidMind components\n", - "\n", - "We'll start with importing the necessary ValidMind components for integrating our agent:\n", - "\n", - "- `Prompt` from `validmind.models` for handling prompt-based model inputs\n", - "- `extract_tool_calls_from_agent_output` and `_convert_to_tool_call_list` from `validmind.scorers.llm.deepeval` for extracting and converting tool calls from agent outputs" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "9aeb8969", - "metadata": {}, - "outputs": [], - "source": [ - "from validmind.models import Prompt\n", - "from validmind.scorers.llm.deepeval import extract_tool_calls_from_agent_output, _convert_to_tool_call_list" - ] - }, - { - "cell_type": "markdown", - "id": "f67f2955", - "metadata": {}, - "source": [ - "\n", - "\n", - "#### Create agent wrapper function\n", - "\n", - "We'll then create a wrapper function that:\n", - "\n", - "- Accepts input in ValidMind's expected format (with `input` and `session_id` fields)\n", - "- Invokes the banking agent with the proper state initialization\n", - "- Captures tool outputs and tool calls for evaluation\n", - "- Returns a standardized response format that includes the prediction, full output, tool messages, and tool call information\n", - "- Handles errors gracefully with fallback responses" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "0e4d5a82", - "metadata": {}, - "outputs": [], - "source": [ - "def banking_agent_fn(input):\n", - " \"\"\"\n", - " Invoke the banking agent with the given input.\n", - " \"\"\"\n", - " try:\n", - " # Initial state for banking agent\n", - " initial_state = {\n", - " \"user_input\": input[\"input\"],\n", - " \"messages\": [HumanMessage(content=input[\"input\"])],\n", - " \"session_id\": input[\"session_id\"],\n", - " \"context\": {}\n", - " }\n", - " session_config = {\"configurable\": {\"thread_id\": input[\"session_id\"]}}\n", - " result = banking_agent.invoke(initial_state, config=session_config)\n", - "\n", - " from utils import capture_tool_output_messages\n", - "\n", - " # Capture all tool outputs and metadata\n", - " captured_data = capture_tool_output_messages(result)\n", - " \n", - " # Access specific tool outputs, this will be used for RAGAS tests\n", - " tool_message = \"\"\n", - " for output in captured_data[\"tool_outputs\"]:\n", - " tool_message += output['content']\n", - " \n", - " tool_calls_found = []\n", - " messages = result['messages']\n", - " for message in messages:\n", - " if hasattr(message, 'tool_calls') and message.tool_calls:\n", - " for tool_call in message.tool_calls:\n", - " # Handle both dictionary and object formats\n", - " if isinstance(tool_call, dict):\n", - " tool_calls_found.append(tool_call['name'])\n", - " else:\n", - " # ToolCall object - use attribute access\n", - " tool_calls_found.append(tool_call.name)\n", - "\n", - "\n", - " return {\n", - " \"prediction\": result['messages'][-1].content[0]['text'],\n", - " \"output\": result,\n", - " \"tool_messages\": [tool_message],\n", - " # \"tool_calls\": tool_calls_found,\n", - " \"tool_called\": _convert_to_tool_call_list(extract_tool_calls_from_agent_output(result))\n", - " }\n", - " except Exception as e:\n", - " # Return a fallback response if the agent fails\n", - " error_message = f\"\"\"I apologize, but I encountered an error while processing your banking request: {str(e)}.\n", - " Please try rephrasing your question or contact support if the issue persists.\"\"\"\n", - " return {\n", - " \"prediction\": error_message, \n", - " \"output\": {\n", - " \"messages\": [HumanMessage(content=input[\"input\"]), SystemMessage(content=error_message)],\n", - " \"error\": str(e)\n", - " }\n", - " }" - ] - }, - { - "cell_type": "markdown", - "id": "4bdc90d6", - "metadata": {}, - "source": [ - "\n", - "\n", - "#### Initialize the ValidMind model object\n", - "\n", - "We'll also need to register the banking agent as a ValidMind model object (`vm_model`) that can be passed to other functions for analysis and tests on the data.\n", - "\n", - "You simply initialize this model object with [`vm.init_model()`](https://docs.validmind.ai/validmind/validmind.html#init_model) that:\n", - "\n", - "- Associates the wrapper function with the model for prediction\n", - "- Stores the system prompt template for documentation\n", - "- Provides a unique `input_id` for tracking and identification\n", - "- Enables the agent to be used with ValidMind's testing and documentation features" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "60a2ce7a", - "metadata": {}, - "outputs": [], - "source": [ - "# Initialize the agent as a model\n", - "vm_banking_model = vm.init_model(\n", - " input_id=\"banking_agent_model\",\n", - " predict_fn=banking_agent_fn,\n", - " prompt=Prompt(template=system_context)\n", - ")" - ] - }, - { - "cell_type": "markdown", - "id": "33ed446a", - "metadata": {}, - "source": [ - "\n", - "\n", - "#### Store the agent reference\n", - "\n", - "We'll also store a reference to the original banking agent object in the ValidMind model. This allows us to access the full agent functionality directly if needed, while still maintaining the wrapper function interface for ValidMind's testing framework." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "2c653471", - "metadata": {}, - "outputs": [], - "source": [ - "# Add the banking agent to the vm model\n", - "vm_banking_model.model = banking_agent" - ] - }, - { - "cell_type": "markdown", - "id": "bf44ea16", - "metadata": {}, - "source": [ - "\n", - "\n", - "#### Verify integration\n", - "\n", - "Let's confirm that the banking agent has been successfully integrated with ValidMind:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8e101b0f", - "metadata": {}, - "outputs": [], - "source": [ - "print(\"Banking Agent Successfully Integrated with ValidMind!\")\n", - "print(f\"Model ID: {vm_banking_model.input_id}\")" - ] - }, - { - "cell_type": "markdown", - "id": "0c80518d", - "metadata": {}, - "source": [ - "\n", - "\n", - "### Validate the system prompt\n", - "\n", - "Let's get an initial sense of how well our defined system prompt meets a few best practices for prompt engineering by running a few tests — we'll run evaluation tests later on our agent's performance.\n", - "\n", - "You run individual tests by calling [the `run_test` function](https://docs.validmind.ai/validmind/validmind/tests.html#run_test) provided by the `validmind.tests` module. Passing in our agentic model as an input, the tests below rate the prompt on a scale of 1-10 against the following criteria:\n", - "\n", - "- **[Clarity](https://docs.validmind.ai/tests/prompt_validation/Clarity.html)** — How clearly the prompt states the task.\n", - "- **[Conciseness](https://docs.validmind.ai/tests/prompt_validation/Conciseness.html)** — How succinctly the prompt states the task.\n", - "- **[Delimitation](https://docs.validmind.ai/tests/prompt_validation/Delimitation.html)** — When using complex prompts containing examples, contextual information, or other elements, is the prompt formatted in such a way that each element is clearly separated?\n", - "- **[NegativeInstruction](https://docs.validmind.ai/tests/prompt_validation/NegativeInstruction.html)** — Whether the prompt contains negative instructions.\n", - "- **[Specificity](https://docs.validmind.ai/tests/prompt_validation/NegativeInstruction.html)** — How specific the prompt defines the task." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "f52dceb1", - "metadata": {}, - "outputs": [], - "source": [ - "vm.tests.run_test(\n", - " \"validmind.prompt_validation.Clarity\",\n", - " inputs={\n", - " \"model\": vm_banking_model,\n", - " },\n", - ").log()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "70d52333", - "metadata": {}, - "outputs": [], - "source": [ - "vm.tests.run_test(\n", - " \"validmind.prompt_validation.Conciseness\",\n", - " inputs={\n", - " \"model\": vm_banking_model,\n", - " },\n", - ").log()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5aa89976", - "metadata": {}, - "outputs": [], - "source": [ - "vm.tests.run_test(\n", - " \"validmind.prompt_validation.Delimitation\",\n", - " inputs={\n", - " \"model\": vm_banking_model,\n", - " },\n", - ").log()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8630197e", - "metadata": {}, - "outputs": [], - "source": [ - "vm.tests.run_test(\n", - " \"validmind.prompt_validation.NegativeInstruction\",\n", - " inputs={\n", - " \"model\": vm_banking_model,\n", - " },\n", - ").log()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "bba99915", - "metadata": {}, - "outputs": [], - "source": [ - "vm.tests.run_test(\n", - " \"validmind.prompt_validation.Specificity\",\n", - " inputs={\n", - " \"model\": vm_banking_model,\n", - " },\n", - ").log()" - ] - }, - { - "cell_type": "markdown", - "id": "af4d6d77", - "metadata": {}, - "source": [ - "\n", - "\n", - "## Initialize the ValidMind datasets\n", - "\n", - "After validation our system prompt, let's import our sample dataset ([banking_test_dataset.py](banking_test_dataset.py)), which we'll use in the next section to evaluate our agent's performance across different banking scenarios:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "0c70ca2c", - "metadata": {}, - "outputs": [], - "source": [ - "from banking_test_dataset import banking_test_dataset" - ] - }, - { - "cell_type": "markdown", - "id": "0268ce6e", - "metadata": {}, - "source": [ - "The next step is to connect your data with a ValidMind `Dataset` object. **This step is always necessary every time you want to connect a dataset to documentation and produce test results through ValidMind,** but you only need to do it once per dataset.\n", - "\n", - "Initialize a ValidMind dataset object using the [`init_dataset` function](https://docs.validmind.ai/validmind/validmind.html#init_dataset) from the ValidMind (`vm`) module. For this example, we'll pass in the following arguments:\n", - "\n", - "- **`input_id`** — A unique identifier that allows tracking what inputs are used when running each individual test.\n", - "- **`dataset`** — The raw dataset that you want to provide as input to tests.\n", - "- **`text_column`** — The name of the column containing the text input data.\n", - "- **`target_column`** — A required argument if tests require access to true values. This is the name of the target column in the dataset." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "a7e9d158", - "metadata": {}, - "outputs": [], - "source": [ - "vm_test_dataset = vm.init_dataset(\n", - " input_id=\"banking_test_dataset\",\n", - " dataset=banking_test_dataset,\n", - " text_column=\"input\",\n", - " target_column=\"possible_outputs\",\n", - ")\n", - "\n", - "print(\"Banking Test Dataset Initialized in ValidMind!\")\n", - "print(f\"Dataset ID: {vm_test_dataset.input_id}\")\n", - "print(f\"Dataset columns: {vm_test_dataset._df.columns}\")\n", - "vm_test_dataset._df" - ] - }, - { - "cell_type": "markdown", - "id": "b9143fb6", - "metadata": {}, - "source": [ - "\n", - "\n", - "### Assign predictions\n", - "\n", - "Now that both the model object and the datasets have been registered, we'll assign predictions to capture the banking agent's responses for evaluation:\n", - "\n", - "- The [`assign_predictions()` method](https://docs.validmind.ai/validmind/validmind/vm_models.html#assign_predictions) from the `Dataset` object can link existing predictions to any number of models.\n", - "- This method links the model's class prediction values and probabilities to our `vm_train_ds` and `vm_test_ds` datasets.\n", - "\n", - "If no prediction values are passed, the method will compute predictions automatically:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "1d462663", - "metadata": {}, - "outputs": [], - "source": [ - "vm_test_dataset.assign_predictions(vm_banking_model)\n", - "\n", - "print(\"Banking Agent Predictions Generated Successfully!\")\n", - "print(f\"Predictions assigned to {len(vm_test_dataset._df)} test cases\")\n", - "vm_test_dataset._df.head()" - ] - }, - { - "cell_type": "markdown", - "id": "8e50467e", - "metadata": {}, - "source": [ - "\n", - "\n", - "## Running accuracy tests\n", - "\n", - "Using [`@vm.test`](https://docs.validmind.ai/validmind/validmind.html#test), let's implement some reusable custom *inline tests* to assess the accuracy of our banking agent:\n", - "\n", - "- An inline test refers to a test written and executed within the same environment as the code being tested — in this case, right in this Jupyter Notebook — without requiring a separate test file or framework.\n", - "- You'll note that the custom test functions are just regular Python functions that can include and require any Python library as you see fit." - ] - }, - { - "cell_type": "markdown", - "id": "6d8a9b90", - "metadata": {}, - "source": [ - "\n", - "\n", - "### Response accuracy test\n", - "\n", - "We'll create a custom test that evaluates the banking agent's ability to provide accurate responses by:\n", - "\n", - "- Testing against a dataset of predefined banking questions and expected answers.\n", - "- Checking if responses contain expected keywords and banking terminology.\n", - "- Providing detailed test results including pass/fail status.\n", - "- Helping identify any gaps in the agent's banking knowledge or response quality." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "90232066", - "metadata": {}, - "outputs": [], - "source": [ - "\n", - "@vm.test(\"my_custom_tests.banking_accuracy_test\")\n", - "def banking_accuracy_test(model, dataset, list_of_columns):\n", - " \"\"\"\n", - " The Banking Accuracy Test evaluates whether the agent’s responses include \n", - " critical domain-specific keywords and phrases that indicate accurate, compliant,\n", - " and contextually appropriate banking information. This test ensures that the agent\n", - " provides responses containing the expected banking terminology, risk classifications,\n", - " account details, or other domain-relevant information required for regulatory compliance,\n", - " customer safety, and operational accuracy.\n", - " \"\"\"\n", - " df = dataset._df\n", - " \n", - " # Pre-compute responses for all tests\n", - " y_true = dataset.y.tolist()\n", - " y_pred = dataset.y_pred(model).tolist()\n", - "\n", - " # Vectorized test results\n", - " test_results = []\n", - " for response, keywords in zip(y_pred, y_true):\n", - " # Convert keywords to list if not already a list\n", - " if not isinstance(keywords, list):\n", - " keywords = [keywords]\n", - " test_results.append(any(str(keyword).lower() in str(response).lower() for keyword in keywords))\n", - " \n", - " results = pd.DataFrame()\n", - " column_names = [col + \"_details\" for col in list_of_columns]\n", - " results[column_names] = df[list_of_columns]\n", - " results[\"actual\"] = y_pred\n", - " results[\"expected\"] = y_true\n", - " results[\"passed\"] = test_results\n", - " results[\"error\"] = None if test_results else f'Response did not contain any expected keywords: {y_true}'\n", - " \n", - " return results" - ] - }, - { - "cell_type": "markdown", - "id": "7eed5265", - "metadata": {}, - "source": [ - "Now that we've defined our custom response accuracy test, we can run the test using the same `run_test()` function we used earlier to validate the system prompt using our sample dataset and agentic model as input, and log the test results to the ValidMind Platform with the [`log()` method](https://docs.validmind.ai/validmind/validmind/vm_models.html#log):" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e68884d5", - "metadata": {}, - "outputs": [], - "source": [ - "result = vm.tests.run_test(\n", - " \"my_custom_tests.banking_accuracy_test\",\n", - " inputs={\n", - " \"dataset\": vm_test_dataset,\n", - " \"model\": vm_banking_model\n", - " },\n", - " params={\n", - " \"list_of_columns\": [\"input\"]\n", - " }\n", - ")\n", - "result.log()" - ] - }, - { - "cell_type": "markdown", - "id": "4d758ddf", - "metadata": {}, - "source": [ - "Let's review the first five rows of the test dataset to inspect the results to see how well the banking agent performed. Each column in the output serves a specific purpose in evaluating agent performance:\n", - "\n", - "| Column header | Description | Importance |\n", - "|--------------|-------------|------------|\n", - "| **`input`** | Original user query or request | Essential for understanding the context of each test case and tracing which inputs led to specific agent behaviors. |\n", - "| **`expected_tools`** | Banking tools that should be invoked for this request | Enables validation of correct tool selection, which is critical for agentic AI systems where choosing the right tool is a key success metric. |\n", - "| **`expected_output`** | Expected output or keywords that should appear in the response | Defines the success criteria for each test case, enabling objective evaluation of whether the agent produced the correct result. |\n", - "| **`session_id`** | Unique identifier for each test session | Allows tracking and correlation of related test runs, debugging specific sessions, and maintaining audit trails. |\n", - "| **`category`** | Classification of the request type | Helps organize test results by domain and identify performance patterns across different banking use cases. |\n", - "| **`banking_agent_model_output`** | Complete agent response including all messages and reasoning | Allows you to examine the full output to assess response quality, completeness, and correctness beyond just keyword matching. |\n", - "| **`banking_agent_model_tool_messages`** | Messages exchanged with the banking tools | Critical for understanding how the agent interacted with tools, what parameters were passed, and what tool outputs were received. |\n", - "| **`banking_agent_model_tool_called`** | Specific tool that was invoked | Enables validation that the agent selected the correct tool for each request, which is fundamental to agentic AI validation. |\n", - "| **`possible_outputs`** | Alternative valid outputs or keywords that could appear in the response | Provides flexibility in evaluation by accounting for multiple acceptable response formats or variations. |" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "78f7edb1", - "metadata": {}, - "outputs": [], - "source": [ - "vm_test_dataset.df.head(5)" - ] - }, - { - "cell_type": "markdown", - "id": "6f233bef", - "metadata": {}, - "source": [ - "\n", - "\n", - "### Tool selection accuracy test\n", - "\n", - "We'll also create a custom test that evaluates the banking agent's ability to select the correct tools for different requests by:\n", - "\n", - "- Testing against a dataset of predefined banking queries with expected tool selections.\n", - "- Comparing the tools actually invoked by the agent against the expected tools for each request.\n", - "- Providing quantitative accuracy scores that measure the proportion of expected tools correctly selected.\n", - "- Helping identify gaps in the agent's understanding of user needs and tool selection logic." - ] - }, - { - "cell_type": "markdown", - "id": "d0b46111", - "metadata": {}, - "source": [ - "First, we'll define a helper function that extracts tool calls from the agent's messages and compares them against the expected tools. This function handles different message formats (dictionary or object) and calculates accuracy scores:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e68798be", - "metadata": {}, - "outputs": [], - "source": [ - "def validate_tool_calls_simple(messages, expected_tools):\n", - " \"\"\"Simple validation of tool calls without RAGAS dependency issues.\"\"\"\n", - " \n", - " tool_calls_found = []\n", - " \n", - " for message in messages:\n", - " if hasattr(message, 'tool_calls') and message.tool_calls:\n", - " for tool_call in message.tool_calls:\n", - " # Handle both dictionary and object formats\n", - " if isinstance(tool_call, dict):\n", - " tool_calls_found.append(tool_call['name'])\n", - " else:\n", - " # ToolCall object - use attribute access\n", - " tool_calls_found.append(tool_call.name)\n", - " \n", - " # Check if expected tools were called\n", - " accuracy = 0.0\n", - " matches = 0\n", - " if expected_tools:\n", - " matches = sum(1 for tool in expected_tools if tool in tool_calls_found)\n", - " accuracy = matches / len(expected_tools)\n", - " \n", - " return {\n", - " 'expected_tools': expected_tools,\n", - " 'found_tools': tool_calls_found,\n", - " 'matches': matches,\n", - " 'total_expected': len(expected_tools) if expected_tools else 0,\n", - " 'accuracy': accuracy,\n", - " }" - ] - }, - { - "cell_type": "markdown", - "id": "1b45472c", - "metadata": {}, - "source": [ - "Now we'll define the main test function that uses the helper function to evaluate tool selection accuracy across all test cases in the dataset:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "604d7313", - "metadata": {}, - "outputs": [], - "source": [ - "@vm.test(\"my_custom_tests.BankingToolCallAccuracy\")\n", - "def BankingToolCallAccuracy(dataset, agent_output_column, expected_tools_column):\n", - " \"\"\"\n", - " Evaluates the tool selection accuracy of a LangGraph-powered banking agent.\n", - "\n", - " This test measures whether the agent correctly identifies and invokes the required banking tools\n", - " for each user query scenario.\n", - " For each case, the outputs generated by the agent (including its tool calls) are compared against an\n", - " expected set of tools. The test considers both coverage and exactness: it computes the proportion of\n", - " expected tools correctly called by the agent for each instance.\n", - "\n", - " Parameters:\n", - " dataset (VMDataset): The dataset containing user queries, agent outputs, and ground-truth tool expectations.\n", - " agent_output_column (str): Dataset column name containing agent outputs (should include tool call details in 'messages').\n", - " expected_tools_column (str): Dataset column specifying the true expected tools (as lists).\n", - "\n", - " Returns:\n", - " List[dict]: Per-row dictionaries with details: expected tools, found tools, match count, total expected, and accuracy score.\n", - "\n", - " Purpose:\n", - " Provides diagnostic evidence of the banking agent's core reasoning ability—specifically, its capacity to\n", - " interpret user needs and select the correct banking actions. Useful for diagnosing gaps in tool coverage,\n", - " misclassifications, or breakdowns in agent logic.\n", - "\n", - " Interpretation:\n", - " - An accuracy of 1.0 signals perfect tool selection for that example.\n", - " - Lower scores may indicate partial or complete failures to invoke required tools.\n", - " - Review 'found_tools' vs. 'expected_tools' to understand the source of discrepancies.\n", - "\n", - " Strengths:\n", - " - Directly tests a core capability of compositional tool-use agents.\n", - " - Framework-agnostic; robust to tool call output format (object or dict).\n", - " - Supports batch validation and result logging for systematic documentation.\n", - "\n", - " Limitations:\n", - " - Does not penalize extra, unnecessary tool calls.\n", - " - Does not assess result quality—only correct invocation.\n", - "\n", - " \"\"\"\n", - " df = dataset._df\n", - " \n", - " results = []\n", - " for i, row in df.iterrows():\n", - " result = validate_tool_calls_simple(row[agent_output_column]['messages'], row[expected_tools_column])\n", - " results.append(result)\n", - " \n", - " return results" - ] - }, - { - "cell_type": "markdown", - "id": "d594c973", - "metadata": {}, - "source": [ - "Finally, we can call our function with `run_test()` and log the test results to the ValidMind Platform:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "dd14115e", - "metadata": {}, - "outputs": [], - "source": [ - "result = vm.tests.run_test(\n", - " \"my_custom_tests.BankingToolCallAccuracy\",\n", - " inputs={\n", - " \"dataset\": vm_test_dataset,\n", - " },\n", - " params={\n", - " \"agent_output_column\": \"banking_agent_model_output\",\n", - " \"expected_tools_column\": \"expected_tools\"\n", - " }\n", - ")\n", - "result.log()" - ] - }, - { - "cell_type": "markdown", - "id": "f78f4107", - "metadata": {}, - "source": [ - "\n", - "\n", - "## Assigning AI evaluation metric scores\n", - "\n", - "*AI agent evaluation metrics* are specialized measurements designed to assess how well autonomous LLM-based agents reason, plan, select and execute tools, and ultimately complete user tasks by analyzing the *full execution trace* — including reasoning steps, tool calls, intermediate decisions, and outcomes, rather than just single input–output pairs. These metrics are essential because agent failures often occur in ways traditional LLM metrics miss — for example, choosing the right tool with wrong arguments, creating a good plan but not following it, or completing a task inefficiently.\n", - "\n", - "In this section, we'll evaluate our banking agent's outputs and add scoring to our sample dataset against metrics defined in [DeepEval’s AI agent evaluation framework](https://deepeval.com/guides/guides-ai-agent-evaluation-metrics) which breaks down AI agent evaluation into three layers with corresponding subcategories: **reasoning**, **action**, and **execution**.\n", - "\n", - "Together, these three metrics enable granular diagnosis of agent behavior, help pinpoint where failures occur (reasoning, action, or execution), and support both development benchmarking and production monitoring." - ] - }, - { - "cell_type": "markdown", - "id": "3a9c853a", - "metadata": {}, - "source": [ - "\n", - "\n", - "### Identify relevant DeepEval scorers\n", - "\n", - "*Scorers* are evaluation metrics that analyze model outputs and store their results in the dataset:\n", - "\n", - "- Each scorer adds a new column to the dataset with format: `{scorer_name}_{metric_name}`\n", - "- The column contains the numeric score (typically `0`-`1`) for each example\n", - "- Multiple scorers can be run on the same dataset, each adding their own column\n", - "- Scores are persisted in the dataset for later analysis and visualization\n", - "- Common scorer patterns include:\n", - " - Model performance metrics (accuracy, F1, etc.)\n", - " - Output quality metrics (relevance, faithfulness)\n", - " - Task-specific metrics (completion, correctness)\n", - "\n", - "Use `list_scorers()` from [`validmind.scorers`](https://docs.validmind.ai/validmind/validmind/tests.html#scorer) to discover all available scoring methods and their IDs that can be used with `assign_scores()`. We'll filter these results to return only DeepEval scorers for our desired three metrics in a formatted table with descriptions:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "730c70ec", - "metadata": {}, - "outputs": [], - "source": [ - "# Load all DeepEval scorers\n", - "llm_scorers_dict = vm.tests.load._load_tests([s for s in vm.scorer.list_scorers() if \"deepeval\" in s.lower()])\n", - "\n", - "# Categorize scorers by metric layer\n", - "reasoning_scorers = {}\n", - "action_scorers = {}\n", - "execution_scorers = {}\n", - "\n", - "for scorer_id, scorer_func in llm_scorers_dict.items():\n", - " tags = getattr(scorer_func, \"__tags__\", [])\n", - " scorer_name = scorer_id.split(\".\")[-1]\n", - "\n", - " if \"reasoning_layer\" in tags:\n", - " reasoning_scorers[scorer_id] = scorer_func\n", - " elif \"action_layer\" in tags:\n", - " action_scorers[scorer_id] = scorer_func\n", - " elif \"TaskCompletion\" in scorer_name:\n", - " execution_scorers[scorer_id] = scorer_func\n", - "\n", - "# Display scorers by category\n", - "print(\"=\" * 80)\n", - "print(\"REASONING LAYER\")\n", - "print(\"=\" * 80)\n", - "if reasoning_scorers:\n", - " reasoning_df = vm.tests.load._pretty_list_tests(reasoning_scorers, truncate=True)\n", - " display(reasoning_df)\n", - "else:\n", - " print(\"No reasoning layer scorers found.\")\n", - "\n", - "print(\"\\n\" + \"=\" * 80)\n", - "print(\"ACTION LAYER\")\n", - "print(\"=\" * 80)\n", - "if action_scorers:\n", - " action_df = vm.tests.load._pretty_list_tests(action_scorers, truncate=True)\n", - " display(action_df)\n", - "else:\n", - " print(\"No action layer scorers found.\")\n", - "\n", - "print(\"\\n\" + \"=\" * 80)\n", - "print(\"EXECUTION LAYER\")\n", - "print(\"=\" * 80)\n", - "if execution_scorers:\n", - " execution_df = vm.tests.load._pretty_list_tests(execution_scorers, truncate=True)\n", - " display(execution_df)\n", - "else:\n", - " print(\"No execution layer scorers found.\")" - ] - }, - { - "cell_type": "markdown", - "id": "4dd73d0d", - "metadata": {}, - "source": [ - "\n", - "\n", - "### Assign reasoning scores\n", - "\n", - "*Reasoning* evaluates planning and strategy generation:\n", - "\n", - "- **Plan quality** – How logical, complete, and efficient the agent’s plan is.\n", - "- **Plan adherence** – Whether the agent follows its own plan during execution." - ] - }, - { - "cell_type": "markdown", - "id": "06ccae28", - "metadata": {}, - "source": [ - "\n", - "\n", - "#### Plan quality score\n", - "\n", - "Let's measure how well our banking agent generates a plan before acting. A high score means the plan is logical, complete, and efficient." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "52f362ba", - "metadata": {}, - "outputs": [], - "source": [ - "vm_test_dataset.assign_scores(\n", - " metrics = \"validmind.scorers.llm.deepeval.PlanQuality\",\n", - " input_column = \"input\",\n", - " actual_output_column = \"banking_agent_model_prediction\",\n", - " tools_called_column = \"banking_agent_model_tool_called\",\n", - " agent_output_column = \"banking_agent_model_output\",\n", - ")\n", - "vm_test_dataset._df.head()" - ] - }, - { - "cell_type": "markdown", - "id": "8dcdc88f", - "metadata": {}, - "source": [ - "\n", - "\n", - "#### Plan adherence score\n", - "\n", - "Let's check whether our banking agent follows the plan it created. Deviations lower this score and indicate gaps between reasoning and execution." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "4124a7c2", - "metadata": {}, - "outputs": [], - "source": [ - "vm_test_dataset.assign_scores(\n", - " metrics = \"validmind.scorers.llm.deepeval.PlanAdherence\",\n", - " input_column = \"input\",\n", - " actual_output_column = \"banking_agent_model_prediction\",\n", - " expected_output_column = \"expected_output\",\n", - " tools_called_column = \"banking_agent_model_tool_called\",\n", - " agent_output_column = \"banking_agent_model_output\",\n", - "\n", - ")\n", - "vm_test_dataset._df.head()" - ] - }, - { - "cell_type": "markdown", - "id": "6da1ac95", - "metadata": {}, - "source": [ - "\n", - "\n", - "### Assign action scores\n", - "\n", - "*Action* assesses tool usage and argument generation:\n", - "\n", - "- **Tool correctness** – Whether the agent selects and calls the right tools.\n", - "- **Argument correctness** – Whether the agent generates correct tool arguments." - ] - }, - { - "cell_type": "markdown", - "id": "d4db8270", - "metadata": {}, - "source": [ - "\n", - "\n", - "#### Tool correctness score\n", - "\n", - "Let's evaluate if our banking agent selects the appropriate tool for the task. Choosing the wrong tool reduces performance even if reasoning was correct." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "8d2e8a25", - "metadata": {}, - "outputs": [], - "source": [ - "vm_test_dataset.assign_scores(\n", - " metrics = \"validmind.scorers.llm.deepeval.ToolCorrectness\",\n", - " input_column = \"input\",\n", - " actual_output_column = \"banking_agent_model_prediction\",\n", - " tools_called_column = \"banking_agent_model_tool_called\",\n", - " expected_tools_column = \"expected_tools\",\n", - " agent_output_column = \"banking_agent_model_output\",\n", - "\n", - ")\n", - "vm_test_dataset._df.head()" - ] - }, - { - "cell_type": "markdown", - "id": "9aa50b05", - "metadata": {}, - "source": [ - "\n", - "\n", - "#### Argument correctness score\n", - "\n", - "Let's assesses whether our banking agent provides correct inputs or arguments to the selected tool. Incorrect arguments can lead to failed or unexpected results." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "04f90489", - "metadata": {}, - "outputs": [], - "source": [ - "vm_test_dataset.assign_scores(\n", - " metrics = \"validmind.scorers.llm.deepeval.ArgumentCorrectness\",\n", - " input_column = \"input\",\n", - " actual_output_column = \"banking_agent_model_prediction\",\n", - " tools_called_column = \"banking_agent_model_tool_called\",\n", - " agent_output_column = \"banking_agent_model_output\",\n", - "\n", - ")\n", - "vm_test_dataset._df.head()" - ] - }, - { - "cell_type": "markdown", - "id": "c59e5595", - "metadata": {}, - "source": [ - "\n", - "\n", - "### Assign execution score\n", - "\n", - "*Execution* measures end-to-end performance:\n", - "\n", - "- **Task completion** – Whether the agent successfully completes the intended task.\n" - ] - }, - { - "cell_type": "markdown", - "id": "d64600ca", - "metadata": {}, - "source": [ - "\n", - "\n", - "#### Task completion score\n", - "\n", - "Let's evaluate whether our banking agent successfully completes the requested tasks. Incomplete task execution can lead to user dissatisfaction and failed banking operations." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "05024f1f", - "metadata": {}, - "outputs": [], - "source": [ - "vm_test_dataset.assign_scores(\n", - " metrics = \"validmind.scorers.llm.deepeval.TaskCompletion\",\n", - " input_column = \"input\",\n", - " actual_output_column = \"banking_agent_model_prediction\",\n", - " agent_output_column = \"banking_agent_model_output\",\n", - " tools_called_column = \"banking_agent_model_tool_called\",\n", - "\n", - ")\n", - "vm_test_dataset._df.head()" - ] - }, - { - "cell_type": "markdown", - "id": "21aa9b0d", - "metadata": {}, - "source": [ - "As you recall from the beginning of this section, when we run scorers through `assign_scores()`, the return values are automatically processed and added as new columns with the format `{scorer_name}_{metric_name}`. Note that the task completion scorer has added a new column `TaskCompletion_score` to our dataset.\n", - "\n", - "We'll use this column to visualize the distribution of task completion scores across our test cases through the [BoxPlot test](https://docs.validmind.ai/validmind/validmind/tests/plots/BoxPlot.html#boxplot):" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7f6d08ca", - "metadata": {}, - "outputs": [], - "source": [ - "vm.tests.run_test(\n", - " \"validmind.plots.BoxPlot\",\n", - " inputs={\"dataset\": vm_test_dataset},\n", - " params={\n", - " \"columns\": \"TaskCompletion_score\",\n", - " \"title\": \"Distribution of Task Completion Scores\",\n", - " \"ylabel\": \"Score\",\n", - " \"figsize\": (8, 6)\n", - " }\n", - ").log()" - ] - }, - { - "cell_type": "markdown", - "id": "012bbcb8", - "metadata": {}, - "source": [ - "\n", - "\n", - "## Running RAGAS tests\n", - "\n", - "Next, let's run some out-of-the-box *Retrieval-Augmented Generation Assessment* (RAGAS) tests available in the ValidMind Library. RAGAS provides specialized metrics for evaluating retrieval-augmented generation systems and conversational AI agents. These metrics analyze different aspects of agent performance by assessing how well systems integrate retrieved information with generated responses.\n", - "\n", - "Our banking agent uses tools to retrieve information and generates responses based on that context, making it similar to a RAG system. RAGAS metrics help evaluate the quality of this integration by analyzing the relationship between retrieved tool outputs, user queries, and generated responses.\n", - "\n", - "These tests provide insights into how well our banking agent integrates tool usage with conversational abilities, ensuring it provides accurate, relevant, and helpful responses to banking users while maintaining fidelity to retrieved information." - ] - }, - { - "cell_type": "markdown", - "id": "2036afba", - "metadata": {}, - "source": [ - "\n", - "\n", - "### Identify relevant RAGAS tests\n", - "\n", - "Let's explore some of ValidMind's available tests. Using ValidMind’s repository of tests streamlines your development testing, and helps you ensure that your models are being documented and evaluated appropriately.\n", - "\n", - "You can pass `tasks` and `tags` as parameters to the [`vm.tests.list_tests()` function](https://docs.validmind.ai/validmind/validmind/tests.html#list_tests) to filter the tests based on the tags and task types:\n", - "\n", - "- **`tasks`** represent the kind of modeling task associated with a test. Here we'll focus on `text_qa` tasks.\n", - "- **`tags`** are free-form descriptions providing more details about the test, for example, what category the test falls into. Here we'll focus on the `ragas` tag.\n", - "\n", - "We'll then run three of these tests returned as examples below." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "0701f5a9", - "metadata": {}, - "outputs": [], - "source": [ - "vm.tests.list_tests(task=\"text_qa\", tags=[\"ragas\"])" - ] - }, - { - "cell_type": "markdown", - "id": "c1741ffc", - "metadata": {}, - "source": [ - "\n", - "\n", - "#### Faithfulness\n", - "\n", - "Let's evaluate whether the banking agent's responses accurately reflect the information retrieved from tools. Unfaithful responses can misreport credit analysis, financial calculations, and compliance results—undermining user trust in the banking agent." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "92044533", - "metadata": {}, - "outputs": [], - "source": [ - "vm.tests.run_test(\n", - " \"validmind.model_validation.ragas.Faithfulness\",\n", - " inputs={\"dataset\": vm_test_dataset},\n", - " param_grid={\n", - " \"user_input_column\": [\"input\"],\n", - " \"response_column\": [\"banking_agent_model_prediction\"],\n", - " \"retrieved_contexts_column\": [\"banking_agent_model_tool_messages\"],\n", - " },\n", - ").log()" - ] - }, - { - "cell_type": "markdown", - "id": "42b71ccc", - "metadata": {}, - "source": [ - "\n", - "\n", - "#### Response Relevancy\n", - "\n", - "Let's evaluate whether the banking agent's answers address the user's original question or request. Irrelevant or off-topic responses can frustrate users and fail to deliver the banking information they need." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "d7483bc3", - "metadata": {}, - "outputs": [], - "source": [ - "vm.tests.run_test(\n", - " \"validmind.model_validation.ragas.ResponseRelevancy\",\n", - " inputs={\"dataset\": vm_test_dataset},\n", - " params={\n", - " \"user_input_column\": \"input\",\n", - " \"response_column\": \"banking_agent_model_prediction\",\n", - " \"retrieved_contexts_column\": \"banking_agent_model_tool_messages\",\n", - " }\n", - ").log()" - ] - }, - { - "cell_type": "markdown", - "id": "4f4d0569", - "metadata": {}, - "source": [ - "\n", - "\n", - "#### Context Recall\n", - "\n", - "Let's evaluate how well the banking agent uses the information retrieved from tools when generating its responses. Poor context recall can lead to incomplete or underinformed answers even when the right tools were selected." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "e5dc00ce", - "metadata": {}, - "outputs": [], - "source": [ - "vm.tests.run_test(\n", - " \"validmind.model_validation.ragas.ContextRecall\",\n", - " inputs={\"dataset\": vm_test_dataset},\n", - " param_grid={\n", - " \"user_input_column\": [\"input\"],\n", - " \"retrieved_contexts_column\": [\"banking_agent_model_tool_messages\"],\n", - " \"reference_column\": [\"banking_agent_model_prediction\"],\n", - " },\n", - ").log()" - ] - }, - { - "cell_type": "markdown", - "id": "b987b00e", - "metadata": {}, - "source": [ - "\n", - "\n", - "## Running safety tests\n", - "\n", - "Finally, let's run some out-of-the-box *safety* tests available in the ValidMind Library. Safety tests provide specialized metrics for evaluating whether AI agents operate reliably and securely. These metrics analyze different aspects of agent behavior by assessing adherence to safety guidelines, consistency of outputs, and resistance to harmful or inappropriate requests.\n", - "\n", - "Our banking agent handles sensitive financial information and user requests, making safety and reliability essential. Safety tests help evaluate whether the agent maintains appropriate boundaries, responds consistently and correctly to inputs, and avoids generating harmful, biased, or unprofessional content.\n", - "\n", - "These tests provide insights into how well our banking agent upholds standards of fairness and professionalism, ensuring it operates reliably and securely for banking users." - ] - }, - { - "cell_type": "markdown", - "id": "a754cca3", - "metadata": {}, - "source": [ - "\n", - "\n", - "#### AspectCritic\n", - "\n", - "Let's evaluate our banking agent's responses across multiple quality dimensions — conciseness, coherence, correctness, harmfulness, and maliciousness. Weak performance on these dimensions can degrade user experience, fall short of professional banking standards, or introduce safety risks. \n", - "\n", - "We'll use the `AspectCritic` we identified earlier:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "148daa2b", - "metadata": {}, - "outputs": [], - "source": [ - "vm.tests.run_test(\n", - " \"validmind.model_validation.ragas.AspectCritic\",\n", - " inputs={\"dataset\": vm_test_dataset},\n", - " param_grid={\n", - " \"user_input_column\": [\"input\"],\n", - " \"response_column\": [\"banking_agent_model_prediction\"],\n", - " \"retrieved_contexts_column\": [\"banking_agent_model_tool_messages\"],\n", - " },\n", - ").log()" - ] - }, - { - "cell_type": "markdown", - "id": "92e5b1f6", - "metadata": {}, - "source": [ - "\n", - "\n", - "#### Bias\n", - "\n", - "Let's evaluate whether our banking agent's prompts contain unintended biases that could affect banking decisions. Biased prompts can lead to unfair or discriminatory outcomes — undermining customer trust and exposing the institution to compliance risk.\n", - "\n", - "We'll first use `list_tests()` again to filter for tests relating to `prompt_validation`:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "74eba86c", - "metadata": {}, - "outputs": [], - "source": [ - "vm.tests.list_tests(filter=\"prompt_validation\")" - ] - }, - { - "cell_type": "markdown", - "id": "bcc66b65", - "metadata": {}, - "source": [ - "And then run the identified `Bias` test:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "062cf8e7", - "metadata": {}, - "outputs": [], - "source": [ - "vm.tests.run_test(\n", - " \"validmind.prompt_validation.Bias\",\n", - " inputs={\n", - " \"model\": vm_banking_model,\n", - " },\n", - ").log()" - ] - }, - { - "cell_type": "markdown", - "id": "a2832750", - "metadata": {}, - "source": [ - "\n", - "\n", - "## Next steps\n", - "\n", - "You can look at the output produced by the ValidMind Library right in the notebook where you ran the code, as you would expect. But there is a better way — use the ValidMind Platform to work with your model documentation." - ] - }, - { - "cell_type": "markdown", - "id": "a8cb1a58", - "metadata": {}, - "source": [ - "\n", - "\n", - "### Work with your model documentation\n", - "\n", - "1. From the **Inventory** in the ValidMind Platform, go to the model you registered earlier. ([Need more help?](https://docs.validmind.ai/guide/model-inventory/working-with-model-inventory.html))\n", - "\n", - "2. In the left sidebar that appears for your model, click **Documentation** under Documents.\n", - "\n", - " What you see is the full draft of your model documentation in a more easily consumable version. From here, you can make qualitative edits to model documentation, view guidelines, collaborate with validators, and submit your model documentation for approval when it's ready. [Learn more ...](https://docs.validmind.ai/guide/working-with-model-documentation.html)\n", - "\n", - "3. Click into any section related to the tests we ran in this notebook, for example: **4.3. Prompt Evaluation** to review the results of the tests we logged." - ] - }, - { - "cell_type": "markdown", - "id": "94ef26be", - "metadata": {}, - "source": [ - "\n", - "\n", - "### Customize the banking agent for your use case\n", - "\n", - "You've now built an agentic AI system designed for banking use cases that supports compliance with supervisory guidance such as SR 11-7 and SS1/23, covering credit and fraud risk assessment for both retail and commercial banking. Extend this example agent to real-world banking scenarios and production deployment by:\n", - "\n", - "- Adapting the banking tools to your organization's specific requirements\n", - "- Adding more banking scenarios and edge cases to your test set\n", - "- Connecting the agent to your banking systems and databases\n", - "- Implementing additional banking-specific tools and workflows" - ] - }, - { - "cell_type": "markdown", - "id": "a681e49c", - "metadata": {}, - "source": [ - "\n", - "\n", - "### Discover more learning resources\n", - "\n", - "Learn more about the ValidMind Library tools we used in this notebook:\n", - "\n", - "- [Custom prompts](https://docs.validmind.ai/notebooks/how_to/tests/run_tests/configure_tests/customize_test_result_descriptions.html)\n", - "- [Custom tests](https://docs.validmind.ai/notebooks/how_to/tests/custom_tests/implement_custom_tests.html)\n", - "- [ValidMind scorers](https://docs.validmind.ai/notebooks/how_to/scoring/assign_scores_complete_tutorial.html)\n", - "\n", - "We also offer many more interactive notebooks to help you document models:\n", - "\n", - "- [Run tests & test suites](https://docs.validmind.ai/developer/how-to/testing-overview.html)\n", - "- [Use ValidMind Library features](https://docs.validmind.ai/developer/how-to/feature-overview.html)\n", - "- [Code samples by use case](https://docs.validmind.ai/guide/samples-jupyter-notebooks.html)\n", - "\n", - "Or, visit our [documentation](https://docs.validmind.ai/) to learn more about ValidMind." - ] - }, - { - "cell_type": "markdown", - "id": "707c1b6e", - "metadata": {}, - "source": [ - "\n", - "\n", - "## Upgrade ValidMind\n", - "\n", - "
After installing ValidMind, you’ll want to periodically make sure you are on the latest version to access any new features and other enhancements.
\n", - "\n", - "Retrieve the information for the currently installed version of ValidMind:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "9733adff", - "metadata": {}, - "outputs": [], - "source": [ - "%pip show validmind" - ] - }, - { - "cell_type": "markdown", - "id": "e4b0b646", - "metadata": {}, - "source": [ - "If the version returned is lower than the version indicated in our [production open-source code](https://github.com/validmind/validmind-library/blob/prod/validmind/__version__.py), restart your notebook and run:\n", - "\n", - "```bash\n", - "%pip install --upgrade validmind\n", - "```" - ] - }, - { - "cell_type": "markdown", - "id": "387fa7f1", - "metadata": {}, - "source": [ - "You may need to restart your kernel after running the upgrade package for changes to be applied." - ] - }, - { - "cell_type": "markdown", - "id": "copyright-de4baf0f42ba4a37946d52586dff1049", - "metadata": {}, - "source": [ - "\n", - "\n", - "\n", - "\n", - "***\n", - "\n", - "Copyright © 2023-2026 ValidMind Inc. All rights reserved.
\n", - "Refer to [LICENSE](https://github.com/validmind/validmind-library/blob/main/LICENSE) for details.
\n", - "SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial
" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "validmind-1QuffXMV-py3.11", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.9" - } - }, - "nbformat": 4, - "nbformat_minor": 5 + "cells": [ + { + "cell_type": "markdown", + "id": "e7277c38", + "metadata": {}, + "source": [ + "# Document an agentic AI system\n", + "\n", + "Build and document an agentic AI system with the ValidMind Library. Construct a LangGraph-based banking agent, assign AI evaluation metric scores to your agent, and run accuracy, RAGAS, and safety tests, then log those test results to the ValidMind Platform.\n", + "\n", + "An _AI agent_ is an autonomous system that interprets inputs, selects from available tools or actions, and executes multi-step behaviors to achieve defined goals. In this notebook, the agent acts as a banking assistant that analyzes user requests and automatically selects and invokes the appropriate specialized banking tool to deliver accurate, compliant, and actionable responses.\n", + "\n", + "- This agent enables financial institutions to automate complex banking workflows where different customer requests require different specialized tools and knowledge bases.\n", + "- Effective validation of agentic AI systems reduces the risks of agents misinterpreting inputs, failing to extract required parameters, or producing incorrect assessments or actions — such as selecting the wrong tool.\n", + "\n", + "
For the LLM components in this notebook to function properly, you'll need access to OpenAI.\n", + "

\n", + "Before you continue, ensure that a valid OPENAI_API_KEY is set in your .env file.
" + ] + }, + { + "cell_type": "markdown", + "id": "a47dd942", + "metadata": {}, + "source": [ + "::: {.content-hidden when-format=\"html\"}\n", + "## Contents \n", + "- [About ValidMind](#toc1__) \n", + " - [Before you begin](#toc1_1__) \n", + " - [New to ValidMind?](#toc1_2__) \n", + " - [Key concepts](#toc1_3__) \n", + "- [Setting up](#toc2__) \n", + " - [Install the ValidMind Library](#toc2_1__) \n", + " - [Initialize the ValidMind Library](#toc2_2__) \n", + " - [Register sample model](#toc2_2_1__) \n", + " - [Apply documentation template](#toc2_2_2__) \n", + " - [Get your code snippet](#toc2_2_3__) \n", + " - [Preview the documentation template](#toc2_2_4__) \n", + " - [Verify OpenAI API access](#toc2_3__) \n", + " - [Initialize the Python environment](#toc2_4__) \n", + "- [Building the LangGraph agent](#toc3__) \n", + " - [Test available banking tools](#toc3_1__) \n", + " - [Create LangGraph banking agent](#toc3_2__) \n", + " - [Define system prompt](#toc3_2_1__) \n", + " - [Initialize the LLM](#toc3_2_2__) \n", + " - [Define agent state structure](#toc3_2_3__) \n", + " - [Create agent workflow function](#toc3_2_4__) \n", + " - [Instantiate the banking agent](#toc3_2_5__) \n", + " - [Integrate agent with ValidMind](#toc3_3__) \n", + " - [Import ValidMind components](#toc3_3_1__) \n", + " - [Create agent wrapper function](#toc3_3_2__) \n", + " - [Initialize the ValidMind model object](#toc3_3_3__) \n", + " - [Store the agent reference](#toc3_3_4__) \n", + " - [Verify integration](#toc3_3_5__) \n", + " - [Validate the system prompt](#toc3_4__) \n", + "- [Initialize the ValidMind datasets](#toc4__) \n", + " - [Assign predictions](#toc4_1__) \n", + "- [Running accuracy tests](#toc5__) \n", + " - [Response accuracy test](#toc5_1__) \n", + " - [Tool selection accuracy test](#toc5_2__) \n", + "- [Assigning AI evaluation metric scores](#toc6__) \n", + " - [Identify relevant DeepEval scorers](#toc6_1__) \n", + " - [Assign reasoning scores](#toc6_2__) \n", + " - [Plan quality score](#toc6_2_1__) \n", + " - [Plan adherence score](#toc6_2_2__) \n", + " - [Assign action scores](#toc6_3__) \n", + " - [Tool correctness score](#toc6_3_1__) \n", + " - [Argument correctness score](#toc6_3_2__) \n", + " - [Assign execution scores](#toc6_4__) \n", + " - [Task completion score](#toc6_4_1__) \n", + "- [Running RAGAS tests](#toc7__) \n", + " - [Identify relevant RAGAS tests](#toc7_1__) \n", + " - [Faithfulness](#toc7_1_1__) \n", + " - [Response Relevancy](#toc7_1_2__) \n", + " - [Context Recall](#toc7_1_3__) \n", + "- [Running safety tests](#toc8__) \n", + " - [AspectCritic](#toc8_1_1__) \n", + " - [Bias](#toc8_1_2__) \n", + "- [Next steps](#toc9__) \n", + " - [Work with your model documentation](#toc9_1__) \n", + " - [Customize the banking agent for your use case](#toc9_2__) \n", + " - [Discover more learning resources](#toc9_3__) \n", + "- [Upgrade ValidMind](#toc10__) \n", + "\n", + ":::\n", + "\n", + "" + ] + }, + { + "cell_type": "markdown", + "id": "ecaad35f", + "metadata": {}, + "source": [ + "\n", + "\n", + "## About ValidMind\n", + "\n", + "ValidMind is a suite of tools for managing model risk, including risk associated with AI and statistical models. \n", + "\n", + "You use the ValidMind Library to automate documentation and validation tests, and then use the ValidMind Platform to collaborate on model documentation. Together, these products simplify model risk management, facilitate compliance with regulations and institutional standards, and enhance collaboration between yourself and model validators." + ] + }, + { + "cell_type": "markdown", + "id": "6ff1f9ef", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Before you begin\n", + "\n", + "This notebook assumes you have basic familiarity with Python, including an understanding of how functions work. If you are new to Python, you can still run the notebook but we recommend further familiarizing yourself with the language. \n", + "\n", + "If you encounter errors due to missing modules in your Python environment, install the modules with `pip install`, and then re-run the notebook. For more help, refer to [Installing Python Modules](https://docs.python.org/3/installing/index.html)." + ] + }, + { + "cell_type": "markdown", + "id": "d7ad8d8c", + "metadata": {}, + "source": [ + "\n", + "\n", + "### New to ValidMind?\n", + "\n", + "If you haven't already seen our documentation on the [ValidMind Library](https://docs.validmind.ai/developer/validmind-library.html), we recommend you begin by exploring the available resources in this section. There, you can learn more about documenting models and running tests, as well as find code samples and our Python Library API reference.\n", + "\n", + "
For access to all features available in this notebook, you'll need access to a ValidMind account.\n", + "

\n", + "Register with ValidMind
" + ] + }, + { + "cell_type": "markdown", + "id": "323caa59", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Key concepts\n", + "\n", + "**Model documentation**: A structured and detailed record pertaining to a model, encompassing key components such as its underlying assumptions, methodologies, data sources, inputs, performance metrics, evaluations, limitations, and intended uses. It serves to ensure transparency, adherence to regulatory requirements, and a clear understanding of potential risks associated with the model’s application.\n", + "\n", + "**Documentation template**: Functions as a test suite and lays out the structure of model documentation, segmented into various sections and sub-sections. Documentation templates define the structure of your model documentation, specifying the tests that should be run, and how the results should be displayed.\n", + "\n", + "**Tests**: A function contained in the ValidMind Library, designed to run a specific quantitative test on the dataset or model. Tests are the building blocks of ValidMind, used to evaluate and document models and datasets, and can be run individually or as part of a suite defined by your model documentation template.\n", + "\n", + "**Metrics**: A subset of tests that do not have thresholds. In the context of this notebook, metrics and tests can be thought of as interchangeable concepts.\n", + "\n", + "**Custom metrics**: Custom metrics are functions that you define to evaluate your model or dataset. These functions can be registered with the ValidMind Library to be used in the ValidMind Platform.\n", + "\n", + "**Inputs**: Objects to be evaluated and documented in the ValidMind Library. They can be any of the following:\n", + "\n", + " - **model**: A single model that has been initialized in ValidMind with [`vm.init_model()`](https://docs.validmind.ai/validmind/validmind.html#init_model).\n", + " - **dataset**: Single dataset that has been initialized in ValidMind with [`vm.init_dataset()`](https://docs.validmind.ai/validmind/validmind.html#init_dataset).\n", + " - **models**: A list of ValidMind models - usually this is used when you want to compare multiple models in your custom metric.\n", + " - **datasets**: A list of ValidMind datasets - usually this is used when you want to compare multiple datasets in your custom metric. (Learn more: [Run tests with multiple datasets](https://docs.validmind.ai/notebooks/how_to/run_tests_that_require_multiple_datasets.html))\n", + "\n", + "**Parameters**: Additional arguments that can be passed when running a ValidMind test, used to pass additional information to a metric, customize its behavior, or provide additional context.\n", + "\n", + "**Outputs**: Custom metrics can return elements like tables or plots. Tables may be a list of dictionaries (each representing a row) or a pandas DataFrame. Plots may be matplotlib or plotly figures.\n", + "\n", + "**Test suites**: Collections of tests designed to run together to automate and generate model documentation end-to-end for specific use-cases.\n", + "\n", + "Example: the [`classifier_full_suite`](https://docs.validmind.ai/validmind/validmind/test_suites/classifier.html#ClassifierFullSuite) test suite runs tests from the [`tabular_dataset`](https://docs.validmind.ai/validmind/validmind/test_suites/tabular_datasets.html) and [`classifier`](https://docs.validmind.ai/validmind/validmind/test_suites/classifier.html) test suites to fully document the data and model sections for binary classification model use-cases." + ] + }, + { + "cell_type": "markdown", + "id": "ddba5169", + "metadata": {}, + "source": [ + "\n", + "\n", + "## Setting up" + ] + }, + { + "cell_type": "markdown", + "id": "b53da99c", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Install the ValidMind Library\n", + "\n", + "
Recommended Python versions\n", + "

\n", + "Python 3.9 <= x <= 3.11
\n", + "\n", + "Let's begin by installing the ValidMind Library with large language model (LLM) support:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1982a118", + "metadata": {}, + "outputs": [], + "source": [ + "%pip install -q \"validmind[llm]\" \"langgraph==0.3.21\"" + ] + }, + { + "cell_type": "markdown", + "id": "dc9dea3a", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Initialize the ValidMind Library" + ] + }, + { + "cell_type": "markdown", + "id": "5848461e", + "metadata": {}, + "source": [ + "\n", + "\n", + "#### Register sample model\n", + "\n", + "Let's first register a sample model for use with this notebook.\n", + "\n", + "1. In a browser, [log in to ValidMind](https://docs.validmind.ai/guide/configuration/log-in-to-validmind.html).\n", + "\n", + "2. In the left sidebar, navigate to **Inventory** and click **+ Register Model**.\n", + "\n", + "3. Enter the model details and click **Next >** to continue to assignment of model stakeholders. ([Need more help?](https://docs.validmind.ai/guide/model-inventory/register-models-in-inventory.html))\n", + "\n", + "4. Select your own name under the **MODEL OWNER** drop-down.\n", + "\n", + "5. Click **Register Model** to add the model to your inventory." + ] + }, + { + "cell_type": "markdown", + "id": "97d0b04b", + "metadata": {}, + "source": [ + "\n", + "\n", + "#### Apply documentation template\n", + "\n", + "Once you've registered your model, let's select a documentation template. A template predefines sections for your model documentation and provides a general outline to follow, making the documentation process much easier.\n", + "\n", + "1. In the left sidebar that appears for your model, click **Documents** and select **Documentation**.\n", + "\n", + "2. Under **TEMPLATE**, select `Agentic AI`.\n", + "\n", + "3. Click **Use Template** to apply the template." + ] + }, + { + "cell_type": "markdown", + "id": "b279d5fa", + "metadata": {}, + "source": [ + "
Can't select this template?\n", + "

\n", + "Your organization administrators may need to add it to your template library:\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "3606cb8c", + "metadata": {}, + "source": [ + "\n", + "\n", + "#### Get your code snippet\n", + "\n", + "ValidMind generates a unique _code snippet_ for each registered model to connect with your developer environment. You initialize the ValidMind Library with this code snippet, which ensures that your documentation and tests are uploaded to the correct model when you run the notebook.\n", + "\n", + "1. On the left sidebar that appears for your model, select **Getting Started** and click **Copy snippet to clipboard**.\n", + "2. Next, [load your model identifier credentials from an `.env` file](https://docs.validmind.ai/developer/model-documentation/store-credentials-in-env-file.html) or replace the placeholder with your own code snippet:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d6ccbefc", + "metadata": {}, + "outputs": [], + "source": [ + "# Load your model identifier credentials from an `.env` file\n", + "\n", + "%load_ext dotenv\n", + "%dotenv .env\n", + "\n", + "# Or replace with your code snippet\n", + "\n", + "import validmind as vm\n", + "\n", + "vm.init(\n", + " # api_host=\"...\",\n", + " # api_key=\"...\",\n", + " # api_secret=\"...\",\n", + " # model=\"...\",\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "2ed79cf0", + "metadata": {}, + "source": [ + "\n", + "\n", + "#### Preview the documentation template\n", + "\n", + "Let's verify that you have connected the ValidMind Library to the ValidMind Platform and that the appropriate *template* is selected for your model.\n", + "\n", + "You will upload documentation and test results unique to your model based on this template later on. For now, **take a look at the default structure that the template provides with [the `vm.preview_template()` function](https://docs.validmind.ai/validmind/validmind.html#preview_template)** from the ValidMind library and note the empty sections:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dffdaa6f", + "metadata": {}, + "outputs": [], + "source": [ + "vm.preview_template()" + ] + }, + { + "cell_type": "markdown", + "id": "b5c5ba68", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Verify OpenAI API access\n", + "\n", + "Verify that a valid `OPENAI_API_KEY` is set in your `.env` file:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "22cc39cb", + "metadata": {}, + "outputs": [], + "source": [ + "# Load environment variables if using .env file\n", + "try:\n", + " from dotenv import load_dotenv\n", + " load_dotenv()\n", + "except ImportError:\n", + " print(\"dotenv not installed. Make sure OPENAI_API_KEY is set in your environment.\")" + ] + }, + { + "cell_type": "markdown", + "id": "e4a9d3a9", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Initialize the Python environment\n", + "\n", + "Let's import all the necessary libraries to prepare for building our banking LangGraph agentic system:\n", + "\n", + "- **Standard libraries** for data handling and environment management.\n", + "- **pandas**, a Python library for data manipulation and analytics, as an alias. We'll also configure pandas to show all columns and all rows at full width for easier debugging and inspection.\n", + "- **LangChain** components for LLM integration and tool management.\n", + "- **LangGraph** for building stateful, multi-step agent workflows.\n", + "- **Banking tools** for specialized financial services as defined in [banking_tools.py](banking_tools.py)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2058d1ac", + "metadata": {}, + "outputs": [], + "source": [ + "from typing import TypedDict, Annotated, Sequence\n", + "\n", + "from langchain_core.messages import BaseMessage, HumanMessage, SystemMessage\n", + "from langchain_openai import ChatOpenAI\n", + "from langgraph.checkpoint.memory import MemorySaver\n", + "from langgraph.graph import StateGraph, END, START\n", + "from langgraph.graph.message import add_messages\n", + "from langgraph.prebuilt import ToolNode\n", + "\n", + "# LOCAL IMPORTS FROM banking_tools.py\n", + "from banking_tools import AVAILABLE_TOOLS\n", + "\n", + "import pandas as pd\n", + "# Configure pandas to show all columns and all rows at full width\n", + "pd.set_option('display.max_columns', None)\n", + "pd.set_option('display.max_colwidth', None)\n", + "pd.set_option('display.width', None)\n", + "pd.set_option('display.max_rows', None)" + ] + }, + { + "cell_type": "markdown", + "id": "e109d075", + "metadata": {}, + "source": [ + "\n", + "\n", + "## Building the LangGraph agent" + ] + }, + { + "cell_type": "markdown", + "id": "15040411", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Test available banking tools\n", + "\n", + "We'll use the demo banking tools defined in `banking_tools.py` that provide use cases of financial services:\n", + "\n", + "- **Credit Risk Analyzer** - Loan applications and credit decisions\n", + "- **Customer Account Manager** - Account services and customer support\n", + "- **Fraud Detection System** - Security and fraud prevention" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1e0a120c", + "metadata": {}, + "outputs": [], + "source": [ + "print(f\"Available tools: {len(AVAILABLE_TOOLS)}\")\n", + "print(\"\\nTool Details:\")\n", + "for i, tool in enumerate(AVAILABLE_TOOLS, 1):\n", + " print(f\" - {tool.name}\")" + ] + }, + { + "cell_type": "markdown", + "id": "04d6785a", + "metadata": {}, + "source": [ + "Let's test each banking tool individually to ensure they're working correctly before integrating them into our agent:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dc0caff2", + "metadata": {}, + "outputs": [], + "source": [ + "# Test 1: Credit Risk Analyzer\n", + "print(\"TEST 1: Credit Risk Analyzer\")\n", + "print(\"-\" * 40)\n", + "try:\n", + " # Access the underlying function using .func\n", + " credit_result = AVAILABLE_TOOLS[0].func(\n", + " customer_income=75000,\n", + " customer_debt=1200,\n", + " credit_score=720,\n", + " loan_amount=50000,\n", + " loan_type=\"personal\"\n", + " )\n", + " print(credit_result)\n", + " print(\"Credit Risk Analyzer test PASSED\")\n", + "except Exception as e:\n", + " print(f\"Credit Risk Analyzer test FAILED: {e}\")\n", + "\n", + "print(\"\" + \"=\" * 60)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b6b227db", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "# Test 2: Customer Account Manager\n", + "print(\"TEST 2: Customer Account Manager\")\n", + "print(\"-\" * 40)\n", + "try:\n", + " # Test checking balance\n", + " account_result = AVAILABLE_TOOLS[1].func(\n", + " account_type=\"checking\",\n", + " customer_id=\"12345\",\n", + " action=\"check_balance\"\n", + " )\n", + " print(account_result)\n", + "\n", + " # Test getting account info\n", + " info_result = AVAILABLE_TOOLS[1].func(\n", + " account_type=\"all\",\n", + " customer_id=\"12345\", \n", + " action=\"get_info\"\n", + " )\n", + " print(info_result)\n", + " print(\"Customer Account Manager test PASSED\")\n", + "except Exception as e:\n", + " print(f\"Customer Account Manager test FAILED: {e}\")\n", + "\n", + "print(\"\" + \"=\" * 60)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a983b30d", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "# Test 3: Fraud Detection System\n", + "print(\"TEST 3: Fraud Detection System\")\n", + "print(\"-\" * 40)\n", + "try:\n", + " fraud_result = AVAILABLE_TOOLS[2].func(\n", + " transaction_id=\"TX123\",\n", + " customer_id=\"12345\",\n", + " transaction_amount=500.00,\n", + " transaction_type=\"withdrawal\",\n", + " location=\"Miami, FL\",\n", + " device_id=\"DEVICE_001\"\n", + " )\n", + " print(fraud_result)\n", + " print(\"Fraud Detection System test PASSED\")\n", + "except Exception as e:\n", + " print(f\"Fraud Detection System test FAILED: {e}\")\n", + "\n", + "print(\"\" + \"=\" * 60)" + ] + }, + { + "cell_type": "markdown", + "id": "6bf04845", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Create LangGraph banking agent\n", + "\n", + "With our tools ready to go, we'll create our intelligent banking agent with LangGraph that automatically selects and uses the appropriate banking tool based on a user request." + ] + }, + { + "cell_type": "markdown", + "id": "31df57f0", + "metadata": {}, + "source": [ + "\n", + "\n", + "#### Define system prompt\n", + "\n", + "We'll begin by defining our system prompt, which provides the LLM with context about its role as a banking assistant and guidance on when to use each available tool:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7971c427", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "# Enhanced banking system prompt with tool selection guidance\n", + "system_context = \"\"\"You are a professional banking AI assistant with access to specialized banking tools.\n", + " Analyze the user's banking request and directly use the most appropriate tools to help them.\n", + " \n", + " AVAILABLE BANKING TOOLS:\n", + " \n", + " credit_risk_analyzer - Analyze credit risk for loan applications and credit decisions\n", + " - Use for: loan applications, credit assessments, risk analysis, mortgage eligibility\n", + " - Examples: \"Analyze credit risk for $50k personal loan\", \"Assess mortgage eligibility for $300k home purchase\"\n", + " - Parameters: customer_income, customer_debt, credit_score, loan_amount, loan_type\n", + "\n", + " customer_account_manager - Manage customer accounts and provide banking services\n", + " - Use for: account information, transaction processing, product recommendations, customer service\n", + " - Examples: \"Check balance for checking account 12345\", \"Recommend products for customer with high balance\"\n", + " - Parameters: account_type, customer_id, action, amount, account_details\n", + "\n", + " fraud_detection_system - Analyze transactions for potential fraud and security risks\n", + " - Use for: transaction monitoring, fraud prevention, risk assessment, security alerts\n", + " - Examples: \"Analyze fraud risk for $500 ATM withdrawal in Miami\", \"Check security for $2000 online purchase\"\n", + " - Parameters: transaction_id, customer_id, transaction_amount, transaction_type, location, device_id\n", + "\n", + " BANKING INSTRUCTIONS:\n", + " - Analyze the user's banking request carefully and identify the primary need\n", + " - If they need credit analysis → use credit_risk_analyzer\n", + " - If they need financial calculations → use financial_calculator\n", + " - If they need account services → use customer_account_manager\n", + " - If they need security analysis → use fraud_detection_system\n", + " - Extract relevant parameters from the user's request\n", + " - Provide helpful, accurate banking responses based on tool outputs\n", + " - Always consider banking regulations, risk management, and best practices\n", + " - Be professional and thorough in your analysis\n", + "\n", + " Choose and use tools wisely to provide the most helpful banking assistance.\n", + " Describe the response in user friendly manner with details describing the tool output. \n", + " Provide the response in at least 500 words.\n", + " Generate a concise execution plan for the banking request.\n", + " \"\"\"" + ] + }, + { + "cell_type": "markdown", + "id": "406835c8", + "metadata": {}, + "source": [ + "\n", + "\n", + "#### Initialize the LLM\n", + "\n", + "Let's initialize the LLM that will power our banking agent:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "866066e7", + "metadata": {}, + "outputs": [], + "source": [ + "# Initialize the main LLM for banking responses\n", + "main_llm = ChatOpenAI(\n", + " model=\"gpt-5-mini\",\n", + " reasoning={\n", + " \"effort\": \"low\",\n", + " \"summary\": \"auto\"\n", + " }\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "cce9685c", + "metadata": {}, + "source": [ + "Then bind the available banking tools to the LLM, enabling the model to automatically recognize and invoke each tool when appropriate based on request input and the system prompt we defined above:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "906d8132", + "metadata": {}, + "outputs": [], + "source": [ + "# Bind all banking tools to the main LLM\n", + "llm_with_tools = main_llm.bind_tools(AVAILABLE_TOOLS)" + ] + }, + { + "cell_type": "markdown", + "id": "2bad8799", + "metadata": {}, + "source": [ + "\n", + "\n", + "#### Define agent state structure\n", + "\n", + "The agent state defines the data structure that flows through the LangGraph workflow. It includes:\n", + "\n", + "- **messages** — The conversation history between the user and agent\n", + "- **user_input** — The current user request\n", + "- **session_id** — A unique identifier for the conversation session\n", + "- **context** — Additional context that can be passed between nodes\n", + "\n", + "Defining this state structure maintains the structure throughout the agent's execution and allows for multi-turn conversations with memory:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6b926ddf", + "metadata": {}, + "outputs": [], + "source": [ + "# Banking Agent State Definition\n", + "class BankingAgentState(TypedDict):\n", + " messages: Annotated[Sequence[BaseMessage], add_messages]\n", + " user_input: str\n", + " session_id: str\n", + " context: dict" + ] + }, + { + "cell_type": "markdown", + "id": "47ce81b7", + "metadata": {}, + "source": [ + "\n", + "\n", + "#### Create agent workflow function\n", + "\n", + "We'll build the LangGraph agent workflow with two main components:\n", + "\n", + "1. **LLM node** — Processes user requests, applies the system prompt, and decides whether to use tools.\n", + "2. **Tools node** — Executes the selected banking tools when the LLM determines they're needed.\n", + "\n", + "The workflow begins with the LLM analyzing the request, then uses tools if needed — or ends if the response is complete, and finally returns to the LLM to generate the final response." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2c9bf585", + "metadata": {}, + "outputs": [], + "source": [ + "def create_banking_langgraph_agent():\n", + " \"\"\"Create a comprehensive LangGraph banking agent with intelligent tool selection.\"\"\"\n", + " def llm_node(state: BankingAgentState) -> BankingAgentState:\n", + " \"\"\"Main LLM node that processes banking requests and selects appropriate tools.\"\"\"\n", + " messages = state[\"messages\"]\n", + " # Add system context to messages\n", + " enhanced_messages = [SystemMessage(content=system_context)] + list(messages)\n", + " # Get LLM response with tool selection\n", + " response = llm_with_tools.invoke(enhanced_messages)\n", + " return {\n", + " **state,\n", + " \"messages\": messages + [response]\n", + " }\n", + " \n", + " def should_continue(state: BankingAgentState) -> str:\n", + " \"\"\"Decide whether to use tools or end the conversation.\"\"\"\n", + " last_message = state[\"messages\"][-1]\n", + " # Check if the LLM wants to use tools\n", + " if hasattr(last_message, 'tool_calls') and last_message.tool_calls:\n", + " return \"tools\"\n", + " return END\n", + " \n", + " # Create the banking state graph\n", + " workflow = StateGraph(BankingAgentState)\n", + " # Add nodes\n", + " workflow.add_node(\"llm\", llm_node)\n", + " workflow.add_node(\"tools\", ToolNode(AVAILABLE_TOOLS))\n", + " # Simplified entry point - go directly to LLM\n", + " workflow.add_edge(START, \"llm\")\n", + " # From LLM, decide whether to use tools or end\n", + " workflow.add_conditional_edges(\n", + " \"llm\",\n", + " should_continue,\n", + " {\"tools\": \"tools\", END: END}\n", + " )\n", + " # Tool execution flows back to LLM for final response\n", + " workflow.add_edge(\"tools\", \"llm\")\n", + " # Set up memory\n", + " memory = MemorySaver()\n", + " # Compile the graph\n", + " agent = workflow.compile(checkpointer=memory)\n", + " return agent" + ] + }, + { + "cell_type": "markdown", + "id": "3eb40287", + "metadata": {}, + "source": [ + "\n", + "\n", + "#### Instantiate the banking agent\n", + "\n", + "Now, we'll create an instance of the banking agent by calling the workflow creation function.\n", + "\n", + "This compiled agent is ready to process banking requests and will automatically select and use the appropriate tools based on user queries:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "455b8ee4", + "metadata": {}, + "outputs": [], + "source": [ + "# Create the banking intelligent agent\n", + "banking_agent = create_banking_langgraph_agent()\n", + "\n", + "print(\"Banking LangGraph Agent Created Successfully!\")\n", + "print(\"\\nFeatures:\")\n", + "print(\" - Intelligent banking tool selection\")\n", + "print(\" - Comprehensive banking system prompt\")\n", + "print(\" - Streamlined workflow: LLM → Tools → Response\")\n", + "print(\" - Automatic tool parameter extraction\")\n", + "print(\" - Professional banking assistance\")" + ] + }, + { + "cell_type": "markdown", + "id": "12691528", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Integrate agent with ValidMind\n", + "\n", + "To integrate our LangGraph banking agent with ValidMind, we need to create a wrapper function that ValidMind can use to invoke the agent and extract the necessary information for testing and documentation, allowing ValidMind to run validation tests on the agent's behavior, tool usage, and responses." + ] + }, + { + "cell_type": "markdown", + "id": "7b78509b", + "metadata": {}, + "source": [ + "\n", + "\n", + "#### Import ValidMind components\n", + "\n", + "We'll start with importing the necessary ValidMind components for integrating our agent:\n", + "\n", + "- `Prompt` from `validmind.models` for handling prompt-based model inputs\n", + "- `extract_tool_calls_from_agent_output` and `_convert_to_tool_call_list` from `validmind.scorers.llm.deepeval` for extracting and converting tool calls from agent outputs" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9aeb8969", + "metadata": {}, + "outputs": [], + "source": [ + "from validmind.models import Prompt\n", + "from validmind.scorers.llm.deepeval import extract_tool_calls_from_agent_output, _convert_to_tool_call_list\n", + "from deepeval.tracing import observe, update_current_span\n", + "from deepeval.test_case import LLMTestCase" + ] + }, + { + "cell_type": "markdown", + "id": "f67f2955", + "metadata": {}, + "source": [ + "\n", + "\n", + "#### Create agent wrapper function\n", + "\n", + "We'll then create a wrapper function that:\n", + "\n", + "- Accepts input in ValidMind's expected format (with `input` and `session_id` fields)\n", + "- Invokes the banking agent with the proper state initialization\n", + "- Captures tool outputs and tool calls for evaluation\n", + "- Returns a standardized response format that includes the prediction, full output, tool messages, and tool call information\n", + "- Handles errors gracefully with fallback responses" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0e4d5a82", + "metadata": {}, + "outputs": [], + "source": [ + "@observe(type=\"agent\")\n", + "def banking_agent_fn(input):\n", + " \"\"\"\n", + " Invoke the banking agent with the given input.\n", + " \"\"\"\n", + " try:\n", + " # Initial state for banking agent\n", + " initial_state = {\n", + " \"user_input\": input[\"input\"],\n", + " \"messages\": [HumanMessage(content=input[\"input\"])],\n", + " \"session_id\": input[\"session_id\"],\n", + " \"context\": {}\n", + " }\n", + " session_config = {\"configurable\": {\"thread_id\": input[\"session_id\"]}}\n", + " result = banking_agent.invoke(initial_state, config=session_config)\n", + "\n", + " from utils import capture_tool_output_messages\n", + "\n", + " # Capture all tool outputs and metadata\n", + " captured_data = capture_tool_output_messages(result)\n", + " \n", + " # Access specific tool outputs, this will be used for RAGAS tests\n", + " tool_message = \"\"\n", + " for output in captured_data[\"tool_outputs\"]:\n", + " tool_message += output['content']\n", + " \n", + " tool_calls_found = []\n", + " messages = result['messages']\n", + " for message in messages:\n", + " if hasattr(message, 'tool_calls') and message.tool_calls:\n", + " for tool_call in message.tool_calls:\n", + " # Handle both dictionary and object formats\n", + " if isinstance(tool_call, dict):\n", + " tool_calls_found.append(tool_call['name'])\n", + " else:\n", + " # ToolCall object - use attribute access\n", + " tool_calls_found.append(tool_call.name)\n", + "\n", + " prediction_text = result['messages'][-1].content[0]['text']\n", + " tools_called_value = _convert_to_tool_call_list(extract_tool_calls_from_agent_output(result))\n", + " expected_tools_value = _convert_to_tool_call_list(input.get(\"expected_tools\", []))\n", + "\n", + " # Feed trace data for DeepEval metrics (e.g. PlanQuality) that require tracing\n", + " update_current_span(\n", + " test_case=LLMTestCase(\n", + " input=input[\"input\"],\n", + " actual_output=prediction_text,\n", + " tools_called=tools_called_value,\n", + " expected_tools=expected_tools_value\n", + " )\n", + " )\n", + "\n", + " return {\n", + " \"prediction\": prediction_text,\n", + " \"output\": result,\n", + " \"tool_messages\": [tool_message],\n", + " # \"tool_calls\": tool_calls_found,\n", + " \"tool_called\": tools_called_value\n", + " }\n", + " except Exception as e:\n", + " # Return a fallback response if the agent fails\n", + " error_message = f\"\"\"I apologize, but I encountered an error while processing your banking request: {str(e)}.\n", + " Please try rephrasing your question or contact support if the issue persists.\"\"\"\n", + " return {\n", + " \"prediction\": error_message, \n", + " \"output\": {\n", + " \"messages\": [HumanMessage(content=input[\"input\"]), SystemMessage(content=error_message)],\n", + " \"error\": str(e)\n", + " }\n", + " }" + ] + }, + { + "cell_type": "markdown", + "id": "4bdc90d6", + "metadata": {}, + "source": [ + "\n", + "\n", + "#### Initialize the ValidMind model object\n", + "\n", + "We'll also need to register the banking agent as a ValidMind model object (`vm_model`) that can be passed to other functions for analysis and tests on the data.\n", + "\n", + "You simply initialize this model object with [`vm.init_model()`](https://docs.validmind.ai/validmind/validmind.html#init_model) that:\n", + "\n", + "- Associates the wrapper function with the model for prediction\n", + "- Stores the system prompt template for documentation\n", + "- Provides a unique `input_id` for tracking and identification\n", + "- Enables the agent to be used with ValidMind's testing and documentation features" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "60a2ce7a", + "metadata": {}, + "outputs": [], + "source": [ + "# Initialize the agent as a model\n", + "vm_banking_model = vm.init_model(\n", + " input_id=\"banking_agent_model\",\n", + " predict_fn=banking_agent_fn,\n", + " prompt=Prompt(template=system_context)\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "33ed446a", + "metadata": {}, + "source": [ + "\n", + "\n", + "#### Store the agent reference\n", + "\n", + "We'll also store a reference to the original banking agent object in the ValidMind model. This allows us to access the full agent functionality directly if needed, while still maintaining the wrapper function interface for ValidMind's testing framework." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2c653471", + "metadata": {}, + "outputs": [], + "source": [ + "# Add the banking agent to the vm model\n", + "vm_banking_model.model = banking_agent" + ] + }, + { + "cell_type": "markdown", + "id": "bf44ea16", + "metadata": {}, + "source": [ + "\n", + "\n", + "#### Verify integration\n", + "\n", + "Let's confirm that the banking agent has been successfully integrated with ValidMind:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8e101b0f", + "metadata": {}, + "outputs": [], + "source": [ + "print(\"Banking Agent Successfully Integrated with ValidMind!\")\n", + "print(f\"Model ID: {vm_banking_model.input_id}\")" + ] + }, + { + "cell_type": "markdown", + "id": "0c80518d", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Validate the system prompt\n", + "\n", + "Let's get an initial sense of how well our defined system prompt meets a few best practices for prompt engineering by running a few tests — we'll run evaluation tests later on our agent's performance.\n", + "\n", + "You run individual tests by calling [the `run_test` function](https://docs.validmind.ai/validmind/validmind/tests.html#run_test) provided by the `validmind.tests` module. Passing in our agentic model as an input, the tests below rate the prompt on a scale of 1-10 against the following criteria:\n", + "\n", + "- **[Clarity](https://docs.validmind.ai/tests/prompt_validation/Clarity.html)** — How clearly the prompt states the task.\n", + "- **[Conciseness](https://docs.validmind.ai/tests/prompt_validation/Conciseness.html)** — How succinctly the prompt states the task.\n", + "- **[Delimitation](https://docs.validmind.ai/tests/prompt_validation/Delimitation.html)** — When using complex prompts containing examples, contextual information, or other elements, is the prompt formatted in such a way that each element is clearly separated?\n", + "- **[NegativeInstruction](https://docs.validmind.ai/tests/prompt_validation/NegativeInstruction.html)** — Whether the prompt contains negative instructions.\n", + "- **[Specificity](https://docs.validmind.ai/tests/prompt_validation/NegativeInstruction.html)** — How specific the prompt defines the task." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f52dceb1", + "metadata": {}, + "outputs": [], + "source": [ + "vm.tests.run_test(\n", + " \"validmind.prompt_validation.Clarity\",\n", + " inputs={\n", + " \"model\": vm_banking_model,\n", + " },\n", + ").log()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "70d52333", + "metadata": {}, + "outputs": [], + "source": [ + "vm.tests.run_test(\n", + " \"validmind.prompt_validation.Conciseness\",\n", + " inputs={\n", + " \"model\": vm_banking_model,\n", + " },\n", + ").log()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5aa89976", + "metadata": {}, + "outputs": [], + "source": [ + "vm.tests.run_test(\n", + " \"validmind.prompt_validation.Delimitation\",\n", + " inputs={\n", + " \"model\": vm_banking_model,\n", + " },\n", + ").log()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8630197e", + "metadata": {}, + "outputs": [], + "source": [ + "vm.tests.run_test(\n", + " \"validmind.prompt_validation.NegativeInstruction\",\n", + " inputs={\n", + " \"model\": vm_banking_model,\n", + " },\n", + ").log()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bba99915", + "metadata": {}, + "outputs": [], + "source": [ + "vm.tests.run_test(\n", + " \"validmind.prompt_validation.Specificity\",\n", + " inputs={\n", + " \"model\": vm_banking_model,\n", + " },\n", + ").log()" + ] + }, + { + "cell_type": "markdown", + "id": "af4d6d77", + "metadata": {}, + "source": [ + "\n", + "\n", + "## Initialize the ValidMind datasets\n", + "\n", + "After validation our system prompt, let's import our sample dataset ([banking_test_dataset.py](banking_test_dataset.py)), which we'll use in the next section to evaluate our agent's performance across different banking scenarios:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0c70ca2c", + "metadata": {}, + "outputs": [], + "source": [ + "from banking_test_dataset import banking_test_dataset" + ] + }, + { + "cell_type": "markdown", + "id": "0268ce6e", + "metadata": {}, + "source": [ + "The next step is to connect your data with a ValidMind `Dataset` object. **This step is always necessary every time you want to connect a dataset to documentation and produce test results through ValidMind,** but you only need to do it once per dataset.\n", + "\n", + "Initialize a ValidMind dataset object using the [`init_dataset` function](https://docs.validmind.ai/validmind/validmind.html#init_dataset) from the ValidMind (`vm`) module. For this example, we'll pass in the following arguments:\n", + "\n", + "- **`input_id`** — A unique identifier that allows tracking what inputs are used when running each individual test.\n", + "- **`dataset`** — The raw dataset that you want to provide as input to tests.\n", + "- **`text_column`** — The name of the column containing the text input data.\n", + "- **`target_column`** — A required argument if tests require access to true values. This is the name of the target column in the dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a7e9d158", + "metadata": {}, + "outputs": [], + "source": [ + "vm_test_dataset = vm.init_dataset(\n", + " input_id=\"banking_test_dataset\",\n", + " dataset=banking_test_dataset,\n", + " text_column=\"input\",\n", + " target_column=\"possible_outputs\",\n", + ")\n", + "\n", + "print(\"Banking Test Dataset Initialized in ValidMind!\")\n", + "print(f\"Dataset ID: {vm_test_dataset.input_id}\")\n", + "print(f\"Dataset columns: {vm_test_dataset._df.columns}\")\n", + "vm_test_dataset._df" + ] + }, + { + "cell_type": "markdown", + "id": "b9143fb6", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Assign predictions\n", + "\n", + "Now that both the model object and the datasets have been registered, we'll assign predictions to capture the banking agent's responses for evaluation:\n", + "\n", + "- The [`assign_predictions()` method](https://docs.validmind.ai/validmind/validmind/vm_models.html#assign_predictions) from the `Dataset` object can link existing predictions to any number of models.\n", + "- This method links the model's class prediction values and probabilities to our `vm_train_ds` and `vm_test_ds` datasets.\n", + "\n", + "If no prediction values are passed, the method will compute predictions automatically:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1d462663", + "metadata": {}, + "outputs": [], + "source": [ + "vm_test_dataset.assign_predictions(vm_banking_model)\n", + "\n", + "print(\"Banking Agent Predictions Generated Successfully!\")\n", + "print(f\"Predictions assigned to {len(vm_test_dataset._df)} test cases\")\n", + "vm_test_dataset._df.head()" + ] + }, + { + "cell_type": "markdown", + "id": "8e50467e", + "metadata": {}, + "source": [ + "\n", + "\n", + "## Running accuracy tests\n", + "\n", + "Using [`@vm.test`](https://docs.validmind.ai/validmind/validmind.html#test), let's implement some reusable custom *inline tests* to assess the accuracy of our banking agent:\n", + "\n", + "- An inline test refers to a test written and executed within the same environment as the code being tested — in this case, right in this Jupyter Notebook — without requiring a separate test file or framework.\n", + "- You'll note that the custom test functions are just regular Python functions that can include and require any Python library as you see fit." + ] + }, + { + "cell_type": "markdown", + "id": "6d8a9b90", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Response accuracy test\n", + "\n", + "We'll create a custom test that evaluates the banking agent's ability to provide accurate responses by:\n", + "\n", + "- Testing against a dataset of predefined banking questions and expected answers.\n", + "- Checking if responses contain expected keywords and banking terminology.\n", + "- Providing detailed test results including pass/fail status.\n", + "- Helping identify any gaps in the agent's banking knowledge or response quality." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "90232066", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "@vm.test(\"my_custom_tests.banking_accuracy_test\")\n", + "def banking_accuracy_test(model, dataset, list_of_columns):\n", + " \"\"\"\n", + " The Banking Accuracy Test evaluates whether the agent’s responses include \n", + " critical domain-specific keywords and phrases that indicate accurate, compliant,\n", + " and contextually appropriate banking information. This test ensures that the agent\n", + " provides responses containing the expected banking terminology, risk classifications,\n", + " account details, or other domain-relevant information required for regulatory compliance,\n", + " customer safety, and operational accuracy.\n", + " \"\"\"\n", + " df = dataset._df\n", + " \n", + " # Pre-compute responses for all tests\n", + " y_true = dataset.y.tolist()\n", + " y_pred = dataset.y_pred(model).tolist()\n", + "\n", + " # Vectorized test results\n", + " test_results = []\n", + " for response, keywords in zip(y_pred, y_true):\n", + " # Convert keywords to list if not already a list\n", + " if not isinstance(keywords, list):\n", + " keywords = [keywords]\n", + " test_results.append(any(str(keyword).lower() in str(response).lower() for keyword in keywords))\n", + " \n", + " results = pd.DataFrame()\n", + " column_names = [col + \"_details\" for col in list_of_columns]\n", + " results[column_names] = df[list_of_columns]\n", + " results[\"actual\"] = y_pred\n", + " results[\"expected\"] = y_true\n", + " results[\"passed\"] = test_results\n", + " results[\"error\"] = None if test_results else f'Response did not contain any expected keywords: {y_true}'\n", + " \n", + " return results" + ] + }, + { + "cell_type": "markdown", + "id": "7eed5265", + "metadata": {}, + "source": [ + "Now that we've defined our custom response accuracy test, we can run the test using the same `run_test()` function we used earlier to validate the system prompt using our sample dataset and agentic model as input, and log the test results to the ValidMind Platform with the [`log()` method](https://docs.validmind.ai/validmind/validmind/vm_models.html#log):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e68884d5", + "metadata": {}, + "outputs": [], + "source": [ + "result = vm.tests.run_test(\n", + " \"my_custom_tests.banking_accuracy_test\",\n", + " inputs={\n", + " \"dataset\": vm_test_dataset,\n", + " \"model\": vm_banking_model\n", + " },\n", + " params={\n", + " \"list_of_columns\": [\"input\"]\n", + " }\n", + ")\n", + "result.log()" + ] + }, + { + "cell_type": "markdown", + "id": "4d758ddf", + "metadata": {}, + "source": [ + "Let's review the first five rows of the test dataset to inspect the results to see how well the banking agent performed. Each column in the output serves a specific purpose in evaluating agent performance:\n", + "\n", + "| Column header | Description | Importance |\n", + "|--------------|-------------|------------|\n", + "| **`input`** | Original user query or request | Essential for understanding the context of each test case and tracing which inputs led to specific agent behaviors. |\n", + "| **`expected_tools`** | Banking tools that should be invoked for this request | Enables validation of correct tool selection, which is critical for agentic AI systems where choosing the right tool is a key success metric. |\n", + "| **`expected_output`** | Expected output or keywords that should appear in the response | Defines the success criteria for each test case, enabling objective evaluation of whether the agent produced the correct result. |\n", + "| **`session_id`** | Unique identifier for each test session | Allows tracking and correlation of related test runs, debugging specific sessions, and maintaining audit trails. |\n", + "| **`category`** | Classification of the request type | Helps organize test results by domain and identify performance patterns across different banking use cases. |\n", + "| **`banking_agent_model_output`** | Complete agent response including all messages and reasoning | Allows you to examine the full output to assess response quality, completeness, and correctness beyond just keyword matching. |\n", + "| **`banking_agent_model_tool_messages`** | Messages exchanged with the banking tools | Critical for understanding how the agent interacted with tools, what parameters were passed, and what tool outputs were received. |\n", + "| **`banking_agent_model_tool_called`** | Specific tool that was invoked | Enables validation that the agent selected the correct tool for each request, which is fundamental to agentic AI validation. |\n", + "| **`possible_outputs`** | Alternative valid outputs or keywords that could appear in the response | Provides flexibility in evaluation by accounting for multiple acceptable response formats or variations. |" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "78f7edb1", + "metadata": {}, + "outputs": [], + "source": [ + "vm_test_dataset.df.head(5)" + ] + }, + { + "cell_type": "markdown", + "id": "6f233bef", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Tool selection accuracy test\n", + "\n", + "We'll also create a custom test that evaluates the banking agent's ability to select the correct tools for different requests by:\n", + "\n", + "- Testing against a dataset of predefined banking queries with expected tool selections.\n", + "- Comparing the tools actually invoked by the agent against the expected tools for each request.\n", + "- Providing quantitative accuracy scores that measure the proportion of expected tools correctly selected.\n", + "- Helping identify gaps in the agent's understanding of user needs and tool selection logic." + ] + }, + { + "cell_type": "markdown", + "id": "d0b46111", + "metadata": {}, + "source": [ + "First, we'll define a helper function that extracts tool calls from the agent's messages and compares them against the expected tools. This function handles different message formats (dictionary or object) and calculates accuracy scores:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e68798be", + "metadata": {}, + "outputs": [], + "source": [ + "def validate_tool_calls_simple(messages, expected_tools):\n", + " \"\"\"Simple validation of tool calls without RAGAS dependency issues.\"\"\"\n", + " \n", + " tool_calls_found = []\n", + " \n", + " for message in messages:\n", + " if hasattr(message, 'tool_calls') and message.tool_calls:\n", + " for tool_call in message.tool_calls:\n", + " # Handle both dictionary and object formats\n", + " if isinstance(tool_call, dict):\n", + " tool_calls_found.append(tool_call['name'])\n", + " else:\n", + " # ToolCall object - use attribute access\n", + " tool_calls_found.append(tool_call.name)\n", + " \n", + " # Check if expected tools were called\n", + " accuracy = 0.0\n", + " matches = 0\n", + " if expected_tools:\n", + " matches = sum(1 for tool in expected_tools if tool in tool_calls_found)\n", + " accuracy = matches / len(expected_tools)\n", + " \n", + " return {\n", + " 'expected_tools': expected_tools,\n", + " 'found_tools': tool_calls_found,\n", + " 'matches': matches,\n", + " 'total_expected': len(expected_tools) if expected_tools else 0,\n", + " 'accuracy': accuracy,\n", + " }" + ] + }, + { + "cell_type": "markdown", + "id": "1b45472c", + "metadata": {}, + "source": [ + "Now we'll define the main test function that uses the helper function to evaluate tool selection accuracy across all test cases in the dataset:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "604d7313", + "metadata": {}, + "outputs": [], + "source": [ + "@vm.test(\"my_custom_tests.BankingToolCallAccuracy\")\n", + "def BankingToolCallAccuracy(dataset, agent_output_column, expected_tools_column):\n", + " \"\"\"\n", + " Evaluates the tool selection accuracy of a LangGraph-powered banking agent.\n", + "\n", + " This test measures whether the agent correctly identifies and invokes the required banking tools\n", + " for each user query scenario.\n", + " For each case, the outputs generated by the agent (including its tool calls) are compared against an\n", + " expected set of tools. The test considers both coverage and exactness: it computes the proportion of\n", + " expected tools correctly called by the agent for each instance.\n", + "\n", + " Parameters:\n", + " dataset (VMDataset): The dataset containing user queries, agent outputs, and ground-truth tool expectations.\n", + " agent_output_column (str): Dataset column name containing agent outputs (should include tool call details in 'messages').\n", + " expected_tools_column (str): Dataset column specifying the true expected tools (as lists).\n", + "\n", + " Returns:\n", + " List[dict]: Per-row dictionaries with details: expected tools, found tools, match count, total expected, and accuracy score.\n", + "\n", + " Purpose:\n", + " Provides diagnostic evidence of the banking agent's core reasoning ability—specifically, its capacity to\n", + " interpret user needs and select the correct banking actions. Useful for diagnosing gaps in tool coverage,\n", + " misclassifications, or breakdowns in agent logic.\n", + "\n", + " Interpretation:\n", + " - An accuracy of 1.0 signals perfect tool selection for that example.\n", + " - Lower scores may indicate partial or complete failures to invoke required tools.\n", + " - Review 'found_tools' vs. 'expected_tools' to understand the source of discrepancies.\n", + "\n", + " Strengths:\n", + " - Directly tests a core capability of compositional tool-use agents.\n", + " - Framework-agnostic; robust to tool call output format (object or dict).\n", + " - Supports batch validation and result logging for systematic documentation.\n", + "\n", + " Limitations:\n", + " - Does not penalize extra, unnecessary tool calls.\n", + " - Does not assess result quality—only correct invocation.\n", + "\n", + " \"\"\"\n", + " df = dataset._df\n", + " \n", + " results = []\n", + " for i, row in df.iterrows():\n", + " result = validate_tool_calls_simple(row[agent_output_column]['messages'], row[expected_tools_column])\n", + " results.append(result)\n", + " \n", + " return results" + ] + }, + { + "cell_type": "markdown", + "id": "d594c973", + "metadata": {}, + "source": [ + "Finally, we can call our function with `run_test()` and log the test results to the ValidMind Platform:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dd14115e", + "metadata": {}, + "outputs": [], + "source": [ + "result = vm.tests.run_test(\n", + " \"my_custom_tests.BankingToolCallAccuracy\",\n", + " inputs={\n", + " \"dataset\": vm_test_dataset,\n", + " },\n", + " params={\n", + " \"agent_output_column\": \"banking_agent_model_output\",\n", + " \"expected_tools_column\": \"expected_tools\"\n", + " }\n", + ")\n", + "result.log()" + ] + }, + { + "cell_type": "markdown", + "id": "f78f4107", + "metadata": {}, + "source": [ + "\n", + "\n", + "## Assigning AI evaluation metric scores\n", + "\n", + "*AI agent evaluation metrics* are specialized measurements designed to assess how well autonomous LLM-based agents reason, plan, select and execute tools, and ultimately complete user tasks by analyzing the *full execution trace* — including reasoning steps, tool calls, intermediate decisions, and outcomes, rather than just single input–output pairs. These metrics are essential because agent failures often occur in ways traditional LLM metrics miss — for example, choosing the right tool with wrong arguments, creating a good plan but not following it, or completing a task inefficiently.\n", + "\n", + "In this section, we'll evaluate our banking agent's outputs and add scoring to our sample dataset against metrics defined in [DeepEval’s AI agent evaluation framework](https://deepeval.com/guides/guides-ai-agent-evaluation-metrics) which breaks down AI agent evaluation into three layers with corresponding subcategories: **reasoning**, **action**, and **execution**.\n", + "\n", + "Together, these three metrics enable granular diagnosis of agent behavior, help pinpoint where failures occur (reasoning, action, or execution), and support both development benchmarking and production monitoring." + ] + }, + { + "cell_type": "markdown", + "id": "3a9c853a", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Identify relevant DeepEval scorers\n", + "\n", + "*Scorers* are evaluation metrics that analyze model outputs and store their results in the dataset:\n", + "\n", + "- Each scorer adds a new column to the dataset with format: `{scorer_name}_{metric_name}`\n", + "- The column contains the numeric score (typically `0`-`1`) for each example\n", + "- Multiple scorers can be run on the same dataset, each adding their own column\n", + "- Scores are persisted in the dataset for later analysis and visualization\n", + "- Common scorer patterns include:\n", + " - Model performance metrics (accuracy, F1, etc.)\n", + " - Output quality metrics (relevance, faithfulness)\n", + " - Task-specific metrics (completion, correctness)\n", + "\n", + "Use `list_scorers()` from [`validmind.scorers`](https://docs.validmind.ai/validmind/validmind/tests.html#scorer) to discover all available scoring methods and their IDs that can be used with `assign_scores()`. We'll filter these results to return only DeepEval scorers for our desired three metrics in a formatted table with descriptions:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "730c70ec", + "metadata": {}, + "outputs": [], + "source": [ + "# Load all DeepEval scorers\n", + "llm_scorers_dict = vm.tests.load._load_tests([s for s in vm.scorer.list_scorers() if \"deepeval\" in s.lower()])\n", + "\n", + "# Categorize scorers by metric layer\n", + "reasoning_scorers = {}\n", + "action_scorers = {}\n", + "execution_scorers = {}\n", + "\n", + "for scorer_id, scorer_func in llm_scorers_dict.items():\n", + " tags = getattr(scorer_func, \"__tags__\", [])\n", + " scorer_name = scorer_id.split(\".\")[-1]\n", + "\n", + " if \"reasoning_layer\" in tags:\n", + " reasoning_scorers[scorer_id] = scorer_func\n", + " elif \"action_layer\" in tags:\n", + " action_scorers[scorer_id] = scorer_func\n", + " elif \"TaskCompletion\" in scorer_name:\n", + " execution_scorers[scorer_id] = scorer_func\n", + "\n", + "# Display scorers by category\n", + "print(\"=\" * 80)\n", + "print(\"REASONING LAYER\")\n", + "print(\"=\" * 80)\n", + "if reasoning_scorers:\n", + " reasoning_df = vm.tests.load._pretty_list_tests(reasoning_scorers, truncate=True)\n", + " display(reasoning_df)\n", + "else:\n", + " print(\"No reasoning layer scorers found.\")\n", + "\n", + "print(\"\\n\" + \"=\" * 80)\n", + "print(\"ACTION LAYER\")\n", + "print(\"=\" * 80)\n", + "if action_scorers:\n", + " action_df = vm.tests.load._pretty_list_tests(action_scorers, truncate=True)\n", + " display(action_df)\n", + "else:\n", + " print(\"No action layer scorers found.\")\n", + "\n", + "print(\"\\n\" + \"=\" * 80)\n", + "print(\"EXECUTION LAYER\")\n", + "print(\"=\" * 80)\n", + "if execution_scorers:\n", + " execution_df = vm.tests.load._pretty_list_tests(execution_scorers, truncate=True)\n", + " display(execution_df)\n", + "else:\n", + " print(\"No execution layer scorers found.\")" + ] + }, + { + "cell_type": "markdown", + "id": "4dd73d0d", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Assign reasoning scores\n", + "\n", + "*Reasoning* evaluates planning and strategy generation:\n", + "\n", + "- **Plan quality** – How logical, complete, and efficient the agent’s plan is.\n", + "- **Plan adherence** – Whether the agent follows its own plan during execution." + ] + }, + { + "cell_type": "markdown", + "id": "06ccae28", + "metadata": {}, + "source": [ + "\n", + "\n", + "#### Plan quality score\n", + "\n", + "Let's measure how well our banking agent generates a plan before acting. A high score means the plan is logical, complete, and efficient." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "52f362ba", + "metadata": {}, + "outputs": [], + "source": [ + "vm_test_dataset.assign_scores(\n", + " metrics = \"validmind.scorers.llm.deepeval.PlanQuality\",\n", + " model = vm_banking_model,\n", + " input_column = \"input\",\n", + ")\n", + "vm_test_dataset._df[['banking_agent_model_PlanQuality_score','banking_agent_model_PlanQuality_reason']]" + ] + }, + { + "cell_type": "markdown", + "id": "8dcdc88f", + "metadata": {}, + "source": [ + "\n", + "\n", + "#### Plan adherence score\n", + "\n", + "Let's check whether our banking agent follows the plan it created. Deviations lower this score and indicate gaps between reasoning and execution." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4124a7c2", + "metadata": {}, + "outputs": [], + "source": [ + "vm_test_dataset.assign_scores(\n", + " metrics = \"validmind.scorers.llm.deepeval.PlanAdherence\",\n", + " input_column = \"input\",\n", + " model = vm_banking_model,\n", + ")\n", + "vm_test_dataset._df[['banking_agent_model_PlanAdherence_score','banking_agent_model_PlanAdherence_reason']]" + ] + }, + { + "cell_type": "markdown", + "id": "6da1ac95", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Assign action scores\n", + "\n", + "*Action* assesses tool usage and argument generation:\n", + "\n", + "- **Tool correctness** – Whether the agent selects and calls the right tools.\n", + "- **Argument correctness** – Whether the agent generates correct tool arguments." + ] + }, + { + "cell_type": "markdown", + "id": "d4db8270", + "metadata": {}, + "source": [ + "\n", + "\n", + "#### Tool correctness score\n", + "\n", + "Let's evaluate if our banking agent selects the appropriate tool for the task. Choosing the wrong tool reduces performance even if reasoning was correct." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8d2e8a25", + "metadata": {}, + "outputs": [], + "source": [ + "vm_test_dataset.assign_scores(\n", + " metrics = \"validmind.scorers.llm.deepeval.ToolCorrectness\",\n", + " input_column = \"input\",\n", + " model = vm_banking_model,\n", + " expected_tools_called_column = \"expected_tools\",\n", + " actual_tools_called_column = \"banking_agent_model_tool_called\",\n", + ")\n", + "vm_test_dataset._df[['banking_agent_model_ToolCorrectness_score','banking_agent_model_ToolCorrectness_reason']]" + ] + }, + { + "cell_type": "markdown", + "id": "9aa50b05", + "metadata": {}, + "source": [ + "\n", + "\n", + "#### Argument correctness score\n", + "\n", + "Let's assesses whether our banking agent provides correct inputs or arguments to the selected tool. Incorrect arguments can lead to failed or unexpected results." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "04f90489", + "metadata": {}, + "outputs": [], + "source": [ + "vm_test_dataset.assign_scores(\n", + " metrics = \"validmind.scorers.llm.deepeval.ArgumentCorrectness\",\n", + " input_column = \"input\",\n", + " model = vm_banking_model,\n", + " actual_tools_called_column = \"banking_agent_model_tool_called\",\n", + ")\n", + "vm_test_dataset._df[['banking_agent_model_ArgumentCorrectness_score','banking_agent_model_ArgumentCorrectness_reason']]" + ] + }, + { + "cell_type": "markdown", + "id": "c59e5595", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Assign execution score\n", + "\n", + "*Execution* measures end-to-end performance:\n", + "\n", + "- **Task completion** – Whether the agent successfully completes the intended task.\n" + ] + }, + { + "cell_type": "markdown", + "id": "d64600ca", + "metadata": {}, + "source": [ + "\n", + "\n", + "#### Task completion score\n", + "\n", + "Let's evaluate whether our banking agent successfully completes the requested tasks. Incomplete task execution can lead to user dissatisfaction and failed banking operations." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "05024f1f", + "metadata": {}, + "outputs": [], + "source": [ + "vm_test_dataset.assign_scores(\n", + " metrics = \"validmind.scorers.llm.deepeval.TaskCompletion\",\n", + " input_column = \"input\",\n", + " model = vm_banking_model,\n", + " actual_tools_called_column = \"banking_agent_model_tool_called\",\n", + ")\n", + "vm_test_dataset._df[['banking_agent_model_TaskCompletion_score','banking_agent_model_TaskCompletion_reason']]" + ] + }, + { + "cell_type": "markdown", + "id": "21aa9b0d", + "metadata": {}, + "source": [ + "As you recall from the beginning of this section, when we run scorers through `assign_scores()`, the return values are automatically processed and added as new columns with the format `{scorer_name}_{metric_name}`. Note that the task completion scorer has added a new column `TaskCompletion_score` to our dataset.\n", + "\n", + "We'll use this column to visualize the distribution of task completion scores across our test cases through the [BoxPlot test](https://docs.validmind.ai/validmind/validmind/tests/plots/BoxPlot.html#boxplot):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7f6d08ca", + "metadata": {}, + "outputs": [], + "source": [ + "vm.tests.run_test(\n", + " \"validmind.plots.BoxPlot\",\n", + " inputs={\"dataset\": vm_test_dataset},\n", + " params={\n", + " \"columns\": \"banking_agent_model_TaskCompletion_score\",\n", + " \"title\": \"Distribution of Task Completion Scores\",\n", + " \"ylabel\": \"Score\",\n", + " \"figsize\": (8, 6)\n", + " }\n", + ").log()" + ] + }, + { + "cell_type": "markdown", + "id": "012bbcb8", + "metadata": {}, + "source": [ + "\n", + "\n", + "## Running RAGAS tests\n", + "\n", + "Next, let's run some out-of-the-box *Retrieval-Augmented Generation Assessment* (RAGAS) tests available in the ValidMind Library. RAGAS provides specialized metrics for evaluating retrieval-augmented generation systems and conversational AI agents. These metrics analyze different aspects of agent performance by assessing how well systems integrate retrieved information with generated responses.\n", + "\n", + "Our banking agent uses tools to retrieve information and generates responses based on that context, making it similar to a RAG system. RAGAS metrics help evaluate the quality of this integration by analyzing the relationship between retrieved tool outputs, user queries, and generated responses.\n", + "\n", + "These tests provide insights into how well our banking agent integrates tool usage with conversational abilities, ensuring it provides accurate, relevant, and helpful responses to banking users while maintaining fidelity to retrieved information." + ] + }, + { + "cell_type": "markdown", + "id": "2036afba", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Identify relevant RAGAS tests\n", + "\n", + "Let's explore some of ValidMind's available tests. Using ValidMind’s repository of tests streamlines your development testing, and helps you ensure that your models are being documented and evaluated appropriately.\n", + "\n", + "You can pass `tasks` and `tags` as parameters to the [`vm.tests.list_tests()` function](https://docs.validmind.ai/validmind/validmind/tests.html#list_tests) to filter the tests based on the tags and task types:\n", + "\n", + "- **`tasks`** represent the kind of modeling task associated with a test. Here we'll focus on `text_qa` tasks.\n", + "- **`tags`** are free-form descriptions providing more details about the test, for example, what category the test falls into. Here we'll focus on the `ragas` tag.\n", + "\n", + "We'll then run three of these tests returned as examples below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0701f5a9", + "metadata": {}, + "outputs": [], + "source": [ + "vm.tests.list_tests(task=\"text_qa\", tags=[\"ragas\"])" + ] + }, + { + "cell_type": "markdown", + "id": "c1741ffc", + "metadata": {}, + "source": [ + "\n", + "\n", + "#### Faithfulness\n", + "\n", + "Let's evaluate whether the banking agent's responses accurately reflect the information retrieved from tools. Unfaithful responses can misreport credit analysis, financial calculations, and compliance results—undermining user trust in the banking agent." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "92044533", + "metadata": {}, + "outputs": [], + "source": [ + "vm.tests.run_test(\n", + " \"validmind.model_validation.ragas.Faithfulness\",\n", + " inputs={\"dataset\": vm_test_dataset},\n", + " param_grid={\n", + " \"user_input_column\": [\"input\"],\n", + " \"response_column\": [\"banking_agent_model_prediction\"],\n", + " \"retrieved_contexts_column\": [\"banking_agent_model_tool_messages\"],\n", + " },\n", + ").log()" + ] + }, + { + "cell_type": "markdown", + "id": "42b71ccc", + "metadata": {}, + "source": [ + "\n", + "\n", + "#### Response Relevancy\n", + "\n", + "Let's evaluate whether the banking agent's answers address the user's original question or request. Irrelevant or off-topic responses can frustrate users and fail to deliver the banking information they need." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d7483bc3", + "metadata": {}, + "outputs": [], + "source": [ + "vm.tests.run_test(\n", + " \"validmind.model_validation.ragas.ResponseRelevancy\",\n", + " inputs={\"dataset\": vm_test_dataset},\n", + " params={\n", + " \"user_input_column\": \"input\",\n", + " \"response_column\": \"banking_agent_model_prediction\",\n", + " \"retrieved_contexts_column\": \"banking_agent_model_tool_messages\",\n", + " }\n", + ").log()" + ] + }, + { + "cell_type": "markdown", + "id": "4f4d0569", + "metadata": {}, + "source": [ + "\n", + "\n", + "#### Context Recall\n", + "\n", + "Let's evaluate how well the banking agent uses the information retrieved from tools when generating its responses. Poor context recall can lead to incomplete or underinformed answers even when the right tools were selected." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e5dc00ce", + "metadata": {}, + "outputs": [], + "source": [ + "vm.tests.run_test(\n", + " \"validmind.model_validation.ragas.ContextRecall\",\n", + " inputs={\"dataset\": vm_test_dataset},\n", + " param_grid={\n", + " \"user_input_column\": [\"input\"],\n", + " \"retrieved_contexts_column\": [\"banking_agent_model_tool_messages\"],\n", + " \"reference_column\": [\"banking_agent_model_prediction\"],\n", + " },\n", + ").log()" + ] + }, + { + "cell_type": "markdown", + "id": "b987b00e", + "metadata": {}, + "source": [ + "\n", + "\n", + "## Running safety tests\n", + "\n", + "Finally, let's run some out-of-the-box *safety* tests available in the ValidMind Library. Safety tests provide specialized metrics for evaluating whether AI agents operate reliably and securely. These metrics analyze different aspects of agent behavior by assessing adherence to safety guidelines, consistency of outputs, and resistance to harmful or inappropriate requests.\n", + "\n", + "Our banking agent handles sensitive financial information and user requests, making safety and reliability essential. Safety tests help evaluate whether the agent maintains appropriate boundaries, responds consistently and correctly to inputs, and avoids generating harmful, biased, or unprofessional content.\n", + "\n", + "These tests provide insights into how well our banking agent upholds standards of fairness and professionalism, ensuring it operates reliably and securely for banking users." + ] + }, + { + "cell_type": "markdown", + "id": "a754cca3", + "metadata": {}, + "source": [ + "\n", + "\n", + "#### AspectCritic\n", + "\n", + "Let's evaluate our banking agent's responses across multiple quality dimensions — conciseness, coherence, correctness, harmfulness, and maliciousness. Weak performance on these dimensions can degrade user experience, fall short of professional banking standards, or introduce safety risks. \n", + "\n", + "We'll use the `AspectCritic` we identified earlier:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "148daa2b", + "metadata": {}, + "outputs": [], + "source": [ + "vm.tests.run_test(\n", + " \"validmind.model_validation.ragas.AspectCritic\",\n", + " inputs={\"dataset\": vm_test_dataset},\n", + " param_grid={\n", + " \"user_input_column\": [\"input\"],\n", + " \"response_column\": [\"banking_agent_model_prediction\"],\n", + " \"retrieved_contexts_column\": [\"banking_agent_model_tool_messages\"],\n", + " },\n", + ").log()" + ] + }, + { + "cell_type": "markdown", + "id": "92e5b1f6", + "metadata": {}, + "source": [ + "\n", + "\n", + "#### Bias\n", + "\n", + "Let's evaluate whether our banking agent's prompts contain unintended biases that could affect banking decisions. Biased prompts can lead to unfair or discriminatory outcomes — undermining customer trust and exposing the institution to compliance risk.\n", + "\n", + "We'll first use `list_tests()` again to filter for tests relating to `prompt_validation`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "74eba86c", + "metadata": {}, + "outputs": [], + "source": [ + "vm.tests.list_tests(filter=\"prompt_validation\")" + ] + }, + { + "cell_type": "markdown", + "id": "bcc66b65", + "metadata": {}, + "source": [ + "And then run the identified `Bias` test:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "062cf8e7", + "metadata": {}, + "outputs": [], + "source": [ + "vm.tests.run_test(\n", + " \"validmind.prompt_validation.Bias\",\n", + " inputs={\n", + " \"model\": vm_banking_model,\n", + " },\n", + ").log()" + ] + }, + { + "cell_type": "markdown", + "id": "a2832750", + "metadata": {}, + "source": [ + "\n", + "\n", + "## Next steps\n", + "\n", + "You can look at the output produced by the ValidMind Library right in the notebook where you ran the code, as you would expect. But there is a better way — use the ValidMind Platform to work with your model documentation." + ] + }, + { + "cell_type": "markdown", + "id": "a8cb1a58", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Work with your model documentation\n", + "\n", + "1. From the **Inventory** in the ValidMind Platform, go to the model you registered earlier. ([Need more help?](https://docs.validmind.ai/guide/model-inventory/working-with-model-inventory.html))\n", + "\n", + "2. In the left sidebar that appears for your model, click **Documentation** under Documents.\n", + "\n", + " What you see is the full draft of your model documentation in a more easily consumable version. From here, you can make qualitative edits to model documentation, view guidelines, collaborate with validators, and submit your model documentation for approval when it's ready. [Learn more ...](https://docs.validmind.ai/guide/working-with-model-documentation.html)\n", + "\n", + "3. Click into any section related to the tests we ran in this notebook, for example: **4.3. Prompt Evaluation** to review the results of the tests we logged." + ] + }, + { + "cell_type": "markdown", + "id": "94ef26be", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Customize the banking agent for your use case\n", + "\n", + "You've now built an agentic AI system designed for banking use cases that supports compliance with supervisory guidance such as SR 11-7 and SS1/23, covering credit and fraud risk assessment for both retail and commercial banking. Extend this example agent to real-world banking scenarios and production deployment by:\n", + "\n", + "- Adapting the banking tools to your organization's specific requirements\n", + "- Adding more banking scenarios and edge cases to your test set\n", + "- Connecting the agent to your banking systems and databases\n", + "- Implementing additional banking-specific tools and workflows" + ] + }, + { + "cell_type": "markdown", + "id": "a681e49c", + "metadata": {}, + "source": [ + "\n", + "\n", + "### Discover more learning resources\n", + "\n", + "Learn more about the ValidMind Library tools we used in this notebook:\n", + "\n", + "- [Custom prompts](https://docs.validmind.ai/notebooks/how_to/customize_test_result_descriptions.html)\n", + "- [Custom tests](https://docs.validmind.ai/notebooks/code_samples/custom_tests/implement_custom_tests.html)\n", + "- [ValidMind scorers](https://docs.validmind.ai/notebooks/how_to/assign_scores_complete_tutorial.html)\n", + "\n", + "We also offer many more interactive notebooks to help you document models:\n", + "\n", + "- [Run tests & test suites](https://docs.validmind.ai/developer/how-to/testing-overview.html)\n", + "- [Use ValidMind Library features](https://docs.validmind.ai/developer/how-to/feature-overview.html)\n", + "- [Code samples by use case](https://docs.validmind.ai/guide/samples-jupyter-notebooks.html)\n", + "\n", + "Or, visit our [documentation](https://docs.validmind.ai/) to learn more about ValidMind." + ] + }, + { + "cell_type": "markdown", + "id": "707c1b6e", + "metadata": {}, + "source": [ + "\n", + "\n", + "## Upgrade ValidMind\n", + "\n", + "
After installing ValidMind, you’ll want to periodically make sure you are on the latest version to access any new features and other enhancements.
\n", + "\n", + "Retrieve the information for the currently installed version of ValidMind:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9733adff", + "metadata": {}, + "outputs": [], + "source": [ + "%pip show validmind" + ] + }, + { + "cell_type": "markdown", + "id": "e4b0b646", + "metadata": {}, + "source": [ + "If the version returned is lower than the version indicated in our [production open-source code](https://github.com/validmind/validmind-library/blob/prod/validmind/__version__.py), restart your notebook and run:\n", + "\n", + "```bash\n", + "%pip install --upgrade validmind\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "387fa7f1", + "metadata": {}, + "source": [ + "You may need to restart your kernel after running the upgrade package for changes to be applied." + ] + }, + { + "cell_type": "markdown", + "id": "copyright-de4baf0f42ba4a37946d52586dff1049", + "metadata": {}, + "source": [ + "\n", + "\n", + "\n", + "\n", + "***\n", + "\n", + "Copyright © 2023-2026 ValidMind Inc. All rights reserved.
\n", + "Refer to [LICENSE](https://github.com/validmind/validmind-library/blob/main/LICENSE) for details.
\n", + "SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial
" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "validmind-1QuffXMV-py3.11", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.9" + } + }, + "nbformat": 4, + "nbformat_minor": 5 } diff --git a/poetry.lock b/poetry.lock index e7ee87613..dbc58f63c 100644 --- a/poetry.lock +++ b/poetry.lock @@ -1,4 +1,4 @@ -# This file is automatically @generated by Poetry 2.1.3 and should not be changed by hand. +# This file is automatically @generated by Poetry 2.1.2 and should not be changed by hand. [[package]] name = "aiodns" @@ -714,10 +714,6 @@ files = [ {file = "Brotli-1.1.0-cp310-cp310-musllinux_1_1_i686.whl", hash = "sha256:a37b8f0391212d29b3a91a799c8e4a2855e0576911cdfb2515487e30e322253d"}, {file = "Brotli-1.1.0-cp310-cp310-musllinux_1_1_ppc64le.whl", hash = "sha256:e84799f09591700a4154154cab9787452925578841a94321d5ee8fb9a9a328f0"}, {file = "Brotli-1.1.0-cp310-cp310-musllinux_1_1_x86_64.whl", hash = "sha256:f66b5337fa213f1da0d9000bc8dc0cb5b896b726eefd9c6046f699b169c41b9e"}, - {file = "Brotli-1.1.0-cp310-cp310-musllinux_1_2_aarch64.whl", hash = "sha256:5dab0844f2cf82be357a0eb11a9087f70c5430b2c241493fc122bb6f2bb0917c"}, - {file = "Brotli-1.1.0-cp310-cp310-musllinux_1_2_i686.whl", hash = "sha256:e4fe605b917c70283db7dfe5ada75e04561479075761a0b3866c081d035b01c1"}, - {file = "Brotli-1.1.0-cp310-cp310-musllinux_1_2_ppc64le.whl", hash = "sha256:1e9a65b5736232e7a7f91ff3d02277f11d339bf34099a56cdab6a8b3410a02b2"}, - {file = "Brotli-1.1.0-cp310-cp310-musllinux_1_2_x86_64.whl", hash = "sha256:58d4b711689366d4a03ac7957ab8c28890415e267f9b6589969e74b6e42225ec"}, {file = "Brotli-1.1.0-cp310-cp310-win32.whl", hash = "sha256:be36e3d172dc816333f33520154d708a2657ea63762ec16b62ece02ab5e4daf2"}, {file = "Brotli-1.1.0-cp310-cp310-win_amd64.whl", hash = "sha256:0c6244521dda65ea562d5a69b9a26120769b7a9fb3db2fe9545935ed6735b128"}, {file = "Brotli-1.1.0-cp311-cp311-macosx_10_9_universal2.whl", hash = "sha256:a3daabb76a78f829cafc365531c972016e4aa8d5b4bf60660ad8ecee19df7ccc"}, @@ -730,14 +726,8 @@ files = [ {file = "Brotli-1.1.0-cp311-cp311-musllinux_1_1_i686.whl", hash = "sha256:19c116e796420b0cee3da1ccec3b764ed2952ccfcc298b55a10e5610ad7885f9"}, {file = "Brotli-1.1.0-cp311-cp311-musllinux_1_1_ppc64le.whl", hash = "sha256:510b5b1bfbe20e1a7b3baf5fed9e9451873559a976c1a78eebaa3b86c57b4265"}, {file = "Brotli-1.1.0-cp311-cp311-musllinux_1_1_x86_64.whl", hash = "sha256:a1fd8a29719ccce974d523580987b7f8229aeace506952fa9ce1d53a033873c8"}, - {file = "Brotli-1.1.0-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:c247dd99d39e0338a604f8c2b3bc7061d5c2e9e2ac7ba9cc1be5a69cb6cd832f"}, - {file = "Brotli-1.1.0-cp311-cp311-musllinux_1_2_i686.whl", hash = "sha256:1b2c248cd517c222d89e74669a4adfa5577e06ab68771a529060cf5a156e9757"}, - {file = "Brotli-1.1.0-cp311-cp311-musllinux_1_2_ppc64le.whl", hash = "sha256:2a24c50840d89ded6c9a8fdc7b6ed3692ed4e86f1c4a4a938e1e92def92933e0"}, - {file = "Brotli-1.1.0-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:f31859074d57b4639318523d6ffdca586ace54271a73ad23ad021acd807eb14b"}, {file = "Brotli-1.1.0-cp311-cp311-win32.whl", hash = "sha256:39da8adedf6942d76dc3e46653e52df937a3c4d6d18fdc94a7c29d263b1f5b50"}, {file = "Brotli-1.1.0-cp311-cp311-win_amd64.whl", hash = "sha256:aac0411d20e345dc0920bdec5548e438e999ff68d77564d5e9463a7ca9d3e7b1"}, - {file = "Brotli-1.1.0-cp312-cp312-macosx_10_13_universal2.whl", hash = "sha256:32d95b80260d79926f5fab3c41701dbb818fde1c9da590e77e571eefd14abe28"}, - {file = "Brotli-1.1.0-cp312-cp312-macosx_10_13_x86_64.whl", hash = "sha256:b760c65308ff1e462f65d69c12e4ae085cff3b332d894637f6273a12a482d09f"}, {file = "Brotli-1.1.0-cp312-cp312-macosx_10_9_universal2.whl", hash = "sha256:316cc9b17edf613ac76b1f1f305d2a748f1b976b033b049a6ecdfd5612c70409"}, {file = "Brotli-1.1.0-cp312-cp312-macosx_10_9_x86_64.whl", hash = "sha256:caf9ee9a5775f3111642d33b86237b05808dafcd6268faa492250e9b78046eb2"}, {file = "Brotli-1.1.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:70051525001750221daa10907c77830bc889cb6d865cc0b813d9db7fefc21451"}, @@ -748,24 +738,8 @@ files = [ {file = "Brotli-1.1.0-cp312-cp312-musllinux_1_1_i686.whl", hash = "sha256:4093c631e96fdd49e0377a9c167bfd75b6d0bad2ace734c6eb20b348bc3ea180"}, {file = "Brotli-1.1.0-cp312-cp312-musllinux_1_1_ppc64le.whl", hash = "sha256:7e4c4629ddad63006efa0ef968c8e4751c5868ff0b1c5c40f76524e894c50248"}, {file = "Brotli-1.1.0-cp312-cp312-musllinux_1_1_x86_64.whl", hash = "sha256:861bf317735688269936f755fa136a99d1ed526883859f86e41a5d43c61d8966"}, - {file = "Brotli-1.1.0-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:87a3044c3a35055527ac75e419dfa9f4f3667a1e887ee80360589eb8c90aabb9"}, - {file = "Brotli-1.1.0-cp312-cp312-musllinux_1_2_i686.whl", hash = "sha256:c5529b34c1c9d937168297f2c1fde7ebe9ebdd5e121297ff9c043bdb2ae3d6fb"}, - {file = "Brotli-1.1.0-cp312-cp312-musllinux_1_2_ppc64le.whl", hash = "sha256:ca63e1890ede90b2e4454f9a65135a4d387a4585ff8282bb72964fab893f2111"}, - {file = "Brotli-1.1.0-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:e79e6520141d792237c70bcd7a3b122d00f2613769ae0cb61c52e89fd3443839"}, {file = "Brotli-1.1.0-cp312-cp312-win32.whl", hash = "sha256:5f4d5ea15c9382135076d2fb28dde923352fe02951e66935a9efaac8f10e81b0"}, {file = "Brotli-1.1.0-cp312-cp312-win_amd64.whl", hash = "sha256:906bc3a79de8c4ae5b86d3d75a8b77e44404b0f4261714306e3ad248d8ab0951"}, - {file = "Brotli-1.1.0-cp313-cp313-macosx_10_13_universal2.whl", hash = "sha256:8bf32b98b75c13ec7cf774164172683d6e7891088f6316e54425fde1efc276d5"}, - {file = "Brotli-1.1.0-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:7bc37c4d6b87fb1017ea28c9508b36bbcb0c3d18b4260fcdf08b200c74a6aee8"}, - {file = "Brotli-1.1.0-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:3c0ef38c7a7014ffac184db9e04debe495d317cc9c6fb10071f7fefd93100a4f"}, - {file = "Brotli-1.1.0-cp313-cp313-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:91d7cc2a76b5567591d12c01f019dd7afce6ba8cba6571187e21e2fc418ae648"}, - {file = "Brotli-1.1.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:a93dde851926f4f2678e704fadeb39e16c35d8baebd5252c9fd94ce8ce68c4a0"}, - {file = "Brotli-1.1.0-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:f0db75f47be8b8abc8d9e31bc7aad0547ca26f24a54e6fd10231d623f183d089"}, - {file = "Brotli-1.1.0-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:6967ced6730aed543b8673008b5a391c3b1076d834ca438bbd70635c73775368"}, - {file = "Brotli-1.1.0-cp313-cp313-musllinux_1_2_i686.whl", hash = "sha256:7eedaa5d036d9336c95915035fb57422054014ebdeb6f3b42eac809928e40d0c"}, - {file = "Brotli-1.1.0-cp313-cp313-musllinux_1_2_ppc64le.whl", hash = "sha256:d487f5432bf35b60ed625d7e1b448e2dc855422e87469e3f450aa5552b0eb284"}, - {file = "Brotli-1.1.0-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:832436e59afb93e1836081a20f324cb185836c617659b07b129141a8426973c7"}, - {file = "Brotli-1.1.0-cp313-cp313-win32.whl", hash = "sha256:43395e90523f9c23a3d5bdf004733246fba087f2948f87ab28015f12359ca6a0"}, - {file = "Brotli-1.1.0-cp313-cp313-win_amd64.whl", hash = "sha256:9011560a466d2eb3f5a6e4929cf4a09be405c64154e12df0dd72713f6500e32b"}, {file = "Brotli-1.1.0-cp36-cp36m-macosx_10_9_x86_64.whl", hash = "sha256:a090ca607cbb6a34b0391776f0cb48062081f5f60ddcce5d11838e67a01928d1"}, {file = "Brotli-1.1.0-cp36-cp36m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:2de9d02f5bda03d27ede52e8cfe7b865b066fa49258cbab568720aa5be80a47d"}, {file = "Brotli-1.1.0-cp36-cp36m-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:2333e30a5e00fe0fe55903c8832e08ee9c3b1382aacf4db26664a16528d51b4b"}, @@ -775,10 +749,6 @@ files = [ {file = "Brotli-1.1.0-cp36-cp36m-musllinux_1_1_i686.whl", hash = "sha256:fd5f17ff8f14003595ab414e45fce13d073e0762394f957182e69035c9f3d7c2"}, {file = "Brotli-1.1.0-cp36-cp36m-musllinux_1_1_ppc64le.whl", hash = "sha256:069a121ac97412d1fe506da790b3e69f52254b9df4eb665cd42460c837193354"}, {file = "Brotli-1.1.0-cp36-cp36m-musllinux_1_1_x86_64.whl", hash = "sha256:e93dfc1a1165e385cc8239fab7c036fb2cd8093728cbd85097b284d7b99249a2"}, - {file = "Brotli-1.1.0-cp36-cp36m-musllinux_1_2_aarch64.whl", hash = "sha256:aea440a510e14e818e67bfc4027880e2fb500c2ccb20ab21c7a7c8b5b4703d75"}, - {file = "Brotli-1.1.0-cp36-cp36m-musllinux_1_2_i686.whl", hash = "sha256:6974f52a02321b36847cd19d1b8e381bf39939c21efd6ee2fc13a28b0d99348c"}, - {file = "Brotli-1.1.0-cp36-cp36m-musllinux_1_2_ppc64le.whl", hash = "sha256:a7e53012d2853a07a4a79c00643832161a910674a893d296c9f1259859a289d2"}, - {file = "Brotli-1.1.0-cp36-cp36m-musllinux_1_2_x86_64.whl", hash = "sha256:d7702622a8b40c49bffb46e1e3ba2e81268d5c04a34f460978c6b5517a34dd52"}, {file = "Brotli-1.1.0-cp36-cp36m-win32.whl", hash = "sha256:a599669fd7c47233438a56936988a2478685e74854088ef5293802123b5b2460"}, {file = "Brotli-1.1.0-cp36-cp36m-win_amd64.whl", hash = "sha256:d143fd47fad1db3d7c27a1b1d66162e855b5d50a89666af46e1679c496e8e579"}, {file = "Brotli-1.1.0-cp37-cp37m-macosx_10_9_x86_64.whl", hash = "sha256:11d00ed0a83fa22d29bc6b64ef636c4552ebafcef57154b4ddd132f5638fbd1c"}, @@ -790,10 +760,6 @@ files = [ {file = "Brotli-1.1.0-cp37-cp37m-musllinux_1_1_i686.whl", hash = "sha256:919e32f147ae93a09fe064d77d5ebf4e35502a8df75c29fb05788528e330fe74"}, {file = "Brotli-1.1.0-cp37-cp37m-musllinux_1_1_ppc64le.whl", hash = "sha256:23032ae55523cc7bccb4f6a0bf368cd25ad9bcdcc1990b64a647e7bbcce9cb5b"}, {file = "Brotli-1.1.0-cp37-cp37m-musllinux_1_1_x86_64.whl", hash = "sha256:224e57f6eac61cc449f498cc5f0e1725ba2071a3d4f48d5d9dffba42db196438"}, - {file = "Brotli-1.1.0-cp37-cp37m-musllinux_1_2_aarch64.whl", hash = "sha256:cb1dac1770878ade83f2ccdf7d25e494f05c9165f5246b46a621cc849341dc01"}, - {file = "Brotli-1.1.0-cp37-cp37m-musllinux_1_2_i686.whl", hash = "sha256:3ee8a80d67a4334482d9712b8e83ca6b1d9bc7e351931252ebef5d8f7335a547"}, - {file = "Brotli-1.1.0-cp37-cp37m-musllinux_1_2_ppc64le.whl", hash = "sha256:5e55da2c8724191e5b557f8e18943b1b4839b8efc3ef60d65985bcf6f587dd38"}, - {file = "Brotli-1.1.0-cp37-cp37m-musllinux_1_2_x86_64.whl", hash = "sha256:d342778ef319e1026af243ed0a07c97acf3bad33b9f29e7ae6a1f68fd083e90c"}, {file = "Brotli-1.1.0-cp37-cp37m-win32.whl", hash = "sha256:587ca6d3cef6e4e868102672d3bd9dc9698c309ba56d41c2b9c85bbb903cdb95"}, {file = "Brotli-1.1.0-cp37-cp37m-win_amd64.whl", hash = "sha256:2954c1c23f81c2eaf0b0717d9380bd348578a94161a65b3a2afc62c86467dd68"}, {file = "Brotli-1.1.0-cp38-cp38-macosx_10_9_universal2.whl", hash = "sha256:efa8b278894b14d6da122a72fefcebc28445f2d3f880ac59d46c90f4c13be9a3"}, @@ -806,10 +772,6 @@ files = [ {file = "Brotli-1.1.0-cp38-cp38-musllinux_1_1_i686.whl", hash = "sha256:1ab4fbee0b2d9098c74f3057b2bc055a8bd92ccf02f65944a241b4349229185a"}, {file = "Brotli-1.1.0-cp38-cp38-musllinux_1_1_ppc64le.whl", hash = "sha256:141bd4d93984070e097521ed07e2575b46f817d08f9fa42b16b9b5f27b5ac088"}, {file = "Brotli-1.1.0-cp38-cp38-musllinux_1_1_x86_64.whl", hash = "sha256:fce1473f3ccc4187f75b4690cfc922628aed4d3dd013d047f95a9b3919a86596"}, - {file = "Brotli-1.1.0-cp38-cp38-musllinux_1_2_aarch64.whl", hash = "sha256:d2b35ca2c7f81d173d2fadc2f4f31e88cc5f7a39ae5b6db5513cf3383b0e0ec7"}, - {file = "Brotli-1.1.0-cp38-cp38-musllinux_1_2_i686.whl", hash = "sha256:af6fa6817889314555aede9a919612b23739395ce767fe7fcbea9a80bf140fe5"}, - {file = "Brotli-1.1.0-cp38-cp38-musllinux_1_2_ppc64le.whl", hash = "sha256:2feb1d960f760a575dbc5ab3b1c00504b24caaf6986e2dc2b01c09c87866a943"}, - {file = "Brotli-1.1.0-cp38-cp38-musllinux_1_2_x86_64.whl", hash = "sha256:4410f84b33374409552ac9b6903507cdb31cd30d2501fc5ca13d18f73548444a"}, {file = "Brotli-1.1.0-cp38-cp38-win32.whl", hash = "sha256:db85ecf4e609a48f4b29055f1e144231b90edc90af7481aa731ba2d059226b1b"}, {file = "Brotli-1.1.0-cp38-cp38-win_amd64.whl", hash = "sha256:3d7954194c36e304e1523f55d7042c59dc53ec20dd4e9ea9d151f1b62b4415c0"}, {file = "Brotli-1.1.0-cp39-cp39-macosx_10_9_universal2.whl", hash = "sha256:5fb2ce4b8045c78ebbc7b8f3c15062e435d47e7393cc57c25115cfd49883747a"}, @@ -822,10 +784,6 @@ files = [ {file = "Brotli-1.1.0-cp39-cp39-musllinux_1_1_i686.whl", hash = "sha256:949f3b7c29912693cee0afcf09acd6ebc04c57af949d9bf77d6101ebb61e388c"}, {file = "Brotli-1.1.0-cp39-cp39-musllinux_1_1_ppc64le.whl", hash = "sha256:89f4988c7203739d48c6f806f1e87a1d96e0806d44f0fba61dba81392c9e474d"}, {file = "Brotli-1.1.0-cp39-cp39-musllinux_1_1_x86_64.whl", hash = "sha256:de6551e370ef19f8de1807d0a9aa2cdfdce2e85ce88b122fe9f6b2b076837e59"}, - {file = "Brotli-1.1.0-cp39-cp39-musllinux_1_2_aarch64.whl", hash = "sha256:0737ddb3068957cf1b054899b0883830bb1fec522ec76b1098f9b6e0f02d9419"}, - {file = "Brotli-1.1.0-cp39-cp39-musllinux_1_2_i686.whl", hash = "sha256:4f3607b129417e111e30637af1b56f24f7a49e64763253bbc275c75fa887d4b2"}, - {file = "Brotli-1.1.0-cp39-cp39-musllinux_1_2_ppc64le.whl", hash = "sha256:6c6e0c425f22c1c719c42670d561ad682f7bfeeef918edea971a79ac5252437f"}, - {file = "Brotli-1.1.0-cp39-cp39-musllinux_1_2_x86_64.whl", hash = "sha256:494994f807ba0b92092a163a0a283961369a65f6cbe01e8891132b7a320e61eb"}, {file = "Brotli-1.1.0-cp39-cp39-win32.whl", hash = "sha256:f0d8a7a6b5983c2496e364b969f0e526647a06b075d034f3297dc66f3b360c64"}, {file = "Brotli-1.1.0-cp39-cp39-win_amd64.whl", hash = "sha256:cdad5b9014d83ca68c25d2e9444e28e967ef16e80f6b436918c700c117a85467"}, {file = "Brotli-1.1.0.tar.gz", hash = "sha256:81de08ac11bcb85841e440c13611c00b67d3bf82698314928d0b676362546724"}, @@ -2261,8 +2219,6 @@ files = [ {file = "greenlet-3.2.4-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:c2ca18a03a8cfb5b25bc1cbe20f3d9a4c80d8c3b13ba3df49ac3961af0b1018d"}, {file = "greenlet-3.2.4-cp310-cp310-musllinux_1_1_aarch64.whl", hash = "sha256:9fe0a28a7b952a21e2c062cd5756d34354117796c6d9215a87f55e38d15402c5"}, {file = "greenlet-3.2.4-cp310-cp310-musllinux_1_1_x86_64.whl", hash = "sha256:8854167e06950ca75b898b104b63cc646573aa5fef1353d4508ecdd1ee76254f"}, - {file = "greenlet-3.2.4-cp310-cp310-musllinux_1_2_aarch64.whl", hash = "sha256:f47617f698838ba98f4ff4189aef02e7343952df3a615f847bb575c3feb177a7"}, - {file = "greenlet-3.2.4-cp310-cp310-musllinux_1_2_x86_64.whl", hash = "sha256:af41be48a4f60429d5cad9d22175217805098a9ef7c40bfef44f7669fb9d74d8"}, {file = "greenlet-3.2.4-cp310-cp310-win_amd64.whl", hash = "sha256:73f49b5368b5359d04e18d15828eecc1806033db5233397748f4ca813ff1056c"}, {file = "greenlet-3.2.4-cp311-cp311-macosx_11_0_universal2.whl", hash = "sha256:96378df1de302bc38e99c3a9aa311967b7dc80ced1dcc6f171e99842987882a2"}, {file = "greenlet-3.2.4-cp311-cp311-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:1ee8fae0519a337f2329cb78bd7a8e128ec0f881073d43f023c7b8d4831d5246"}, @@ -2272,8 +2228,6 @@ files = [ {file = "greenlet-3.2.4-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:2523e5246274f54fdadbce8494458a2ebdcdbc7b802318466ac5606d3cded1f8"}, {file = "greenlet-3.2.4-cp311-cp311-musllinux_1_1_aarch64.whl", hash = "sha256:1987de92fec508535687fb807a5cea1560f6196285a4cde35c100b8cd632cc52"}, {file = "greenlet-3.2.4-cp311-cp311-musllinux_1_1_x86_64.whl", hash = "sha256:55e9c5affaa6775e2c6b67659f3a71684de4c549b3dd9afca3bc773533d284fa"}, - {file = "greenlet-3.2.4-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:c9c6de1940a7d828635fbd254d69db79e54619f165ee7ce32fda763a9cb6a58c"}, - {file = "greenlet-3.2.4-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:03c5136e7be905045160b1b9fdca93dd6727b180feeafda6818e6496434ed8c5"}, {file = "greenlet-3.2.4-cp311-cp311-win_amd64.whl", hash = "sha256:9c40adce87eaa9ddb593ccb0fa6a07caf34015a29bf8d344811665b573138db9"}, {file = "greenlet-3.2.4-cp312-cp312-macosx_11_0_universal2.whl", hash = "sha256:3b67ca49f54cede0186854a008109d6ee71f66bd57bb36abd6d0a0267b540cdd"}, {file = "greenlet-3.2.4-cp312-cp312-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:ddf9164e7a5b08e9d22511526865780a576f19ddd00d62f8a665949327fde8bb"}, @@ -2283,8 +2237,6 @@ files = [ {file = "greenlet-3.2.4-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:3b3812d8d0c9579967815af437d96623f45c0f2ae5f04e366de62a12d83a8fb0"}, {file = "greenlet-3.2.4-cp312-cp312-musllinux_1_1_aarch64.whl", hash = "sha256:abbf57b5a870d30c4675928c37278493044d7c14378350b3aa5d484fa65575f0"}, {file = "greenlet-3.2.4-cp312-cp312-musllinux_1_1_x86_64.whl", hash = "sha256:20fb936b4652b6e307b8f347665e2c615540d4b42b3b4c8a321d8286da7e520f"}, - {file = "greenlet-3.2.4-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:ee7a6ec486883397d70eec05059353b8e83eca9168b9f3f9a361971e77e0bcd0"}, - {file = "greenlet-3.2.4-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:326d234cbf337c9c3def0676412eb7040a35a768efc92504b947b3e9cfc7543d"}, {file = "greenlet-3.2.4-cp312-cp312-win_amd64.whl", hash = "sha256:a7d4e128405eea3814a12cc2605e0e6aedb4035bf32697f72deca74de4105e02"}, {file = "greenlet-3.2.4-cp313-cp313-macosx_11_0_universal2.whl", hash = "sha256:1a921e542453fe531144e91e1feedf12e07351b1cf6c9e8a3325ea600a715a31"}, {file = "greenlet-3.2.4-cp313-cp313-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:cd3c8e693bff0fff6ba55f140bf390fa92c994083f838fece0f63be121334945"}, @@ -2294,8 +2246,6 @@ files = [ {file = "greenlet-3.2.4-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:23768528f2911bcd7e475210822ffb5254ed10d71f4028387e5a99b4c6699671"}, {file = "greenlet-3.2.4-cp313-cp313-musllinux_1_1_aarch64.whl", hash = "sha256:00fadb3fedccc447f517ee0d3fd8fe49eae949e1cd0f6a611818f4f6fb7dc83b"}, {file = "greenlet-3.2.4-cp313-cp313-musllinux_1_1_x86_64.whl", hash = "sha256:d25c5091190f2dc0eaa3f950252122edbbadbb682aa7b1ef2f8af0f8c0afefae"}, - {file = "greenlet-3.2.4-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:6e343822feb58ac4d0a1211bd9399de2b3a04963ddeec21530fc426cc121f19b"}, - {file = "greenlet-3.2.4-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:ca7f6f1f2649b89ce02f6f229d7c19f680a6238af656f61e0115b24857917929"}, {file = "greenlet-3.2.4-cp313-cp313-win_amd64.whl", hash = "sha256:554b03b6e73aaabec3745364d6239e9e012d64c68ccd0b8430c64ccc14939a8b"}, {file = "greenlet-3.2.4-cp314-cp314-macosx_11_0_universal2.whl", hash = "sha256:49a30d5fda2507ae77be16479bdb62a660fa51b1eb4928b524975b3bde77b3c0"}, {file = "greenlet-3.2.4-cp314-cp314-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:299fd615cd8fc86267b47597123e3f43ad79c9d8a22bebdce535e53550763e2f"}, @@ -2303,8 +2253,6 @@ files = [ {file = "greenlet-3.2.4-cp314-cp314-manylinux2014_s390x.manylinux_2_17_s390x.whl", hash = "sha256:b4a1870c51720687af7fa3e7cda6d08d801dae660f75a76f3845b642b4da6ee1"}, {file = "greenlet-3.2.4-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:061dc4cf2c34852b052a8620d40f36324554bc192be474b9e9770e8c042fd735"}, {file = "greenlet-3.2.4-cp314-cp314-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:44358b9bf66c8576a9f57a590d5f5d6e72fa4228b763d0e43fee6d3b06d3a337"}, - {file = "greenlet-3.2.4-cp314-cp314-musllinux_1_2_aarch64.whl", hash = "sha256:2917bdf657f5859fbf3386b12d68ede4cf1f04c90c3a6bc1f013dd68a22e2269"}, - {file = "greenlet-3.2.4-cp314-cp314-musllinux_1_2_x86_64.whl", hash = "sha256:015d48959d4add5d6c9f6c5210ee3803a830dce46356e3bc326d6776bde54681"}, {file = "greenlet-3.2.4-cp314-cp314-win_amd64.whl", hash = "sha256:e37ab26028f12dbb0ff65f29a8d3d44a765c61e729647bf2ddfbbed621726f01"}, {file = "greenlet-3.2.4-cp39-cp39-macosx_11_0_universal2.whl", hash = "sha256:b6a7c19cf0d2742d0809a4c05975db036fdff50cd294a93632d6a310bf9ac02c"}, {file = "greenlet-3.2.4-cp39-cp39-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:27890167f55d2387576d1f41d9487ef171849ea0359ce1510ca6e06c8bece11d"}, @@ -2314,8 +2262,6 @@ files = [ {file = "greenlet-3.2.4-cp39-cp39-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:c9913f1a30e4526f432991f89ae263459b1c64d1608c0d22a5c79c287b3c70df"}, {file = "greenlet-3.2.4-cp39-cp39-musllinux_1_1_aarch64.whl", hash = "sha256:b90654e092f928f110e0007f572007c9727b5265f7632c2fa7415b4689351594"}, {file = "greenlet-3.2.4-cp39-cp39-musllinux_1_1_x86_64.whl", hash = "sha256:81701fd84f26330f0d5f4944d4e92e61afe6319dcd9775e39396e39d7c3e5f98"}, - {file = "greenlet-3.2.4-cp39-cp39-musllinux_1_2_aarch64.whl", hash = "sha256:28a3c6b7cd72a96f61b0e4b2a36f681025b60ae4779cc73c1535eb5f29560b10"}, - {file = "greenlet-3.2.4-cp39-cp39-musllinux_1_2_x86_64.whl", hash = "sha256:52206cd642670b0b320a1fd1cbfd95bca0e043179c1d8a045f2c6109dfe973be"}, {file = "greenlet-3.2.4-cp39-cp39-win32.whl", hash = "sha256:65458b409c1ed459ea899e939f0e1cdb14f58dbc803f2f93c5eab5694d32671b"}, {file = "greenlet-3.2.4-cp39-cp39-win_amd64.whl", hash = "sha256:d2e685ade4dafd447ede19c31277a224a239a0a1a4eca4e6390efedf20260cfb"}, {file = "greenlet-3.2.4.tar.gz", hash = "sha256:0dca0d95ff849f9a364385f36ab49f50065d76964944638be9691e1832e9f86d"}, @@ -8059,14 +8005,14 @@ test = ["Cython", "array-api-strict (>=2.3.1)", "asv", "gmpy2", "hypothesis (>=6 [[package]] name = "scorecardpy" -version = "0.1.9.7" +version = "0.1.9.6" description = "Credit Risk Scorecard" optional = true python-versions = "*" groups = ["main"] markers = "extra == \"all\" or extra == \"credit-risk\"" files = [ - {file = "scorecardpy-0.1.9.7.tar.gz", hash = "sha256:a81c7e6f3bf5f10a87b61af73b25f1fc8bc5acbadf5d9e38c3addb02df128d03"}, + {file = "scorecardpy-0.1.9.6.tar.gz", hash = "sha256:53b339100199276c55270a43bad9fb9fc23fcd285bcf9ba3caf7b585c7adea97"}, ] [package.dependencies] @@ -8958,12 +8904,6 @@ files = [ {file = "statsmodels-0.14.5-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:5a085d47c8ef5387279a991633883d0e700de2b0acc812d7032d165888627bef"}, {file = "statsmodels-0.14.5-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:9f866b2ebb2904b47c342d00def83c526ef2eb1df6a9a3c94ba5fe63d0005aec"}, {file = "statsmodels-0.14.5-cp313-cp313-win_amd64.whl", hash = "sha256:2a06bca03b7a492f88c8106103ab75f1a5ced25de90103a89f3a287518017939"}, - {file = "statsmodels-0.14.5-cp314-cp314-macosx_10_15_x86_64.whl", hash = "sha256:07c4dad25bbb15864a31b4917a820f6d104bdc24e5ddadcda59027390c3bed9e"}, - {file = "statsmodels-0.14.5-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:babb067c852e966c2c933b79dbb5d0240919d861941a2ef6c0e13321c255528d"}, - {file = "statsmodels-0.14.5-cp314-cp314-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:110194b137286173cc676d7bad0119a197778de6478fc6cbdc3b33571165ac1e"}, - {file = "statsmodels-0.14.5-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:9c8a9c384a60c80731b278e7fd18764364c8817f4995b13a175d636f967823d1"}, - {file = "statsmodels-0.14.5-cp314-cp314-musllinux_1_2_x86_64.whl", hash = "sha256:557df3a870a57248df744fdfcc444ecbc5bdbf1c042b8a8b5d8e3e797830dc2a"}, - {file = "statsmodels-0.14.5-cp314-cp314-win_amd64.whl", hash = "sha256:95af7a9c4689d514f4341478b891f867766f3da297f514b8c4adf08f4fa61d03"}, {file = "statsmodels-0.14.5-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:b23b8f646dd78ef5e8d775d879208f8dc0a73418b41c16acac37361ff9ab7738"}, {file = "statsmodels-0.14.5-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:4e5e26b21d2920905764fb0860957d08b5ba2fae4466ef41b1f7c53ecf9fc7fa"}, {file = "statsmodels-0.14.5-cp39-cp39-manylinux2014_aarch64.manylinux_2_17_aarch64.manylinux_2_28_aarch64.whl", hash = "sha256:4a060c7e0841c549c8ce2825fd6687e6757e305d9c11c9a73f6c5a0ce849bb69"}, @@ -10409,4 +10349,4 @@ xgboost = ["xgboost"] [metadata] lock-version = "2.1" python-versions = ">=3.9,<3.13" -content-hash = "496d696b7104f8c79ea9d6d1721856b5e8d43f2b05f682309b2ae2a6f507ae25" +content-hash = "ed3612700d61dbd64bf4850986c5cc7fd527d82ffd448cd00a2d02a00356cc3a" diff --git a/pyproject.toml b/pyproject.toml index a606cda4f..b221c85a2 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -53,7 +53,7 @@ all = [ "bert-score (>=0.3.13)", "arch", "shap (>=0.46.0)", - "scorecardpy (>=0.1.9.6,<0.2.0)", + "scorecardpy==0.1.9.6", ] huggingface = [ "transformers (>=4.32.0,<5.0.0)", @@ -81,7 +81,7 @@ pytorch = ["torch (>=2.0.0)"] stats = ["scipy", "statsmodels", "arch"] xgboost = ["xgboost (>=1.5.2,<3)"] explainability = ["shap (>=0.46.0)"] -credit_risk = ["scorecardpy (>=0.1.9.6,<0.2.0)"] +credit_risk = ["scorecardpy==0.1.9.6"] datasets = ["datasets (>=2.10.0,<3.0.0)"] pii-detection = ["presidio-analyzer", "presidio-structured"] diff --git a/tests/unit_tests/data_validation/test_WOEBinPlots.py b/tests/unit_tests/data_validation/test_WOEBinPlots.py index 03a332aaf..a9f9ef063 100644 --- a/tests/unit_tests/data_validation/test_WOEBinPlots.py +++ b/tests/unit_tests/data_validation/test_WOEBinPlots.py @@ -2,11 +2,16 @@ import pandas as pd import validmind as vm import plotly.graph_objs as go -from validmind.errors import SkipTestError -from validmind.tests.data_validation.WOEBinPlots import WOEBinPlots +from validmind.errors import MissingDependencyError, SkipTestError from validmind import RawData +try: + from validmind.tests.data_validation.WOEBinPlots import WOEBinPlots +except MissingDependencyError: + WOEBinPlots = None + +@unittest.skipIf(WOEBinPlots is None, "scorecardpy is not installed") class TestWOEBinPlots(unittest.TestCase): def setUp(self): # Create a sample dataset with categorical features and binary target diff --git a/tests/unit_tests/data_validation/test_WOEBinTable.py b/tests/unit_tests/data_validation/test_WOEBinTable.py index 089d81b33..c4633b161 100644 --- a/tests/unit_tests/data_validation/test_WOEBinTable.py +++ b/tests/unit_tests/data_validation/test_WOEBinTable.py @@ -1,10 +1,16 @@ import unittest import pandas as pd import validmind as vm -from validmind.errors import SkipTestError -from validmind.tests.data_validation.WOEBinTable import WOEBinTable, RawData +from validmind.errors import MissingDependencyError, SkipTestError +from validmind import RawData +try: + from validmind.tests.data_validation.WOEBinTable import WOEBinTable +except MissingDependencyError: + WOEBinTable = None + +@unittest.skipIf(WOEBinTable is None, "scorecardpy is not installed") class TestWOEBinTable(unittest.TestCase): def setUp(self): # Create a sample dataset with categorical and numeric features and binary target diff --git a/validmind/datasets/credit_risk/lending_club.py b/validmind/datasets/credit_risk/lending_club.py index c7fab7713..06491631b 100644 --- a/validmind/datasets/credit_risk/lending_club.py +++ b/validmind/datasets/credit_risk/lending_club.py @@ -20,14 +20,12 @@ try: import scorecardpy as sc except ImportError as e: - if "scorecardpy" in str(e): - raise MissingDependencyError( - "Missing required package `scorecardpy` for credit risk demos. " - "Please run `pip install validmind[credit_risk]` or `pip install scorecardpy`.", - required_dependencies=["scorecardpy"], - extra="credit_risk", - ) from e - raise e + raise MissingDependencyError( + "Missing required package `scorecardpy` for credit risk demos. " + "Please run `pip install validmind[credit_risk]` or `pip install scorecardpy`.", + required_dependencies=["scorecardpy"], + extra="credit_risk", + ) from e current_path = os.path.dirname(os.path.abspath(__file__)) dataset_path = os.path.join(current_path, "datasets") diff --git a/validmind/scorers/llm/deepeval/ArgumentCorrectness.py b/validmind/scorers/llm/deepeval/ArgumentCorrectness.py index 5c271deb4..424c15ef8 100644 --- a/validmind/scorers/llm/deepeval/ArgumentCorrectness.py +++ b/validmind/scorers/llm/deepeval/ArgumentCorrectness.py @@ -40,10 +40,7 @@ def ArgumentCorrectness( dataset: VMDataset, threshold: float = 0.7, input_column: str = "input", - tools_called_column: str = "tools_called", - agent_output_column: str = "agent_output", - actual_output_column: str = "actual_output", - strict_mode: bool = False, + actual_tools_called_column: str = "tools_called", ) -> List[Dict[str, Any]]: """Evaluates agent argument correctness using deepeval's ArgumentCorrectnessMetric. @@ -55,8 +52,15 @@ def ArgumentCorrectness( evaluates argument correctness based on the input context rather than comparing against expected values. + When ``model`` is provided, the agent is run per row inside deepeval's evals_iterator + so the metric receives trace data. Without ``model``, the dataset-only path uses + pre-computed columns. + Args: dataset: Dataset containing the agent input and tool calls + model: Optional ValidMind model (agent) with predict_fn. When provided, the + agent is run per row inside deepeval's evals_iterator so the metric + receives trace data. threshold: Minimum passing threshold (default: 0.7) input_column: Column name for the task input (default: "input") tools_called_column: Column name for tools called (default: "tools_called") @@ -69,51 +73,34 @@ def ArgumentCorrectness( Raises: ValueError: If required columns are missing """ - # Validate required columns exist in dataset + from validmind.scorers.llm.deepeval import _convert_to_tool_call_list + missing_columns: List[str] = [] if input_column not in dataset._df.columns: missing_columns.append(input_column) - + if actual_tools_called_column not in dataset._df.columns: + missing_columns.append(actual_tools_called_column) if missing_columns: raise ValueError( - f"Required columns {missing_columns} not found in dataset. " + f"ToolCorrectness with model requires columns {missing_columns}. " f"Available columns: {dataset._df.columns.tolist()}" ) - _, model = get_client_and_model() - - metric = ArgumentCorrectnessMetric( - threshold=threshold, - model=model, - include_reason=True, - strict_mode=strict_mode, - verbose_mode=False, - ) - - # Import helper functions to avoid circular import - from validmind.scorers.llm.deepeval import ( - _convert_to_tool_call_list, - extract_tool_calls_from_agent_output, - ) - + _, llm_model = get_client_and_model() results: List[Dict[str, Any]] = [] - for _, row in dataset._df.iterrows(): - input_value = row[input_column] - # Extract tools called - if tools_called_column in dataset._df.columns: - tools_called_value = row.get(tools_called_column, []) - else: - agent_output = row.get(agent_output_column, {}) - tools_called_value = extract_tool_calls_from_agent_output(agent_output) - tools_called_list = _convert_to_tool_call_list(tools_called_value) + for _, row in dataset._df.iterrows(): + actual_tools_value = row.get(actual_tools_called_column, []) + actual_tools_list = _convert_to_tool_call_list(actual_tools_value) - actual_output_value = row.get(actual_output_column, "") + metric = ArgumentCorrectnessMetric( + threshold=threshold, + model=llm_model, + ) test_case = LLMTestCase( - input=input_value, - tools_called=tools_called_list, - actual_output=actual_output_value, + input=row[input_column], + tools_called=actual_tools_list, ) result = evaluate(test_cases=[test_case], metrics=[metric]) @@ -121,5 +108,4 @@ def ArgumentCorrectness( score = metric_data.score reason = getattr(metric_data, "reason", "No reason provided") results.append({"score": score, "reason": reason}) - return results diff --git a/validmind/scorers/llm/deepeval/PlanAdherence.py b/validmind/scorers/llm/deepeval/PlanAdherence.py index 04fb8c30f..32896e9c1 100644 --- a/validmind/scorers/llm/deepeval/PlanAdherence.py +++ b/validmind/scorers/llm/deepeval/PlanAdherence.py @@ -2,18 +2,18 @@ # Refer to the LICENSE file in the root of this repository for details. # SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial -from typing import Any, Dict, List +import uuid +from typing import Any, Dict, List, Optional from validmind import tags, tasks from validmind.ai.utils import get_client_and_model from validmind.errors import MissingDependencyError from validmind.tests.decorator import scorer +from validmind.vm_models import VMModel from validmind.vm_models.dataset import VMDataset try: - from deepeval import evaluate from deepeval.metrics import PlanAdherenceMetric - from deepeval.test_case import LLMTestCase except ImportError as e: if "deepeval" in str(e): raise MissingDependencyError( @@ -33,84 +33,77 @@ @tasks("llm") def PlanAdherence( dataset: VMDataset, + model: Optional[VMModel] = None, threshold: float = 0.7, input_column: str = "input", - tools_called_column: str = "tools_called", - actual_output_column: str = "actual_output", - expected_output_column: str = "expected_output", strict_mode: bool = False, ) -> List[Dict[str, Any]]: - """Evaluates agent plan adherence using deepeval's PlanAdherenceMetric. + """ + Evaluates whether an agent follows its generated plan during execution, using deepeval's PlanAdherenceMetric. + + Plan adherence is critical in agentic reasoning: even the best plans become irrelevant if not executed faithfully. + This scorer measures whether each agent output (per row) aligns with the agent's plan. - This metric evaluates whether your agent follows its own plan during execution. - Creating a good plan is only half the battle—an agent that deviates from its - strategy mid-execution undermines its own reasoning. + If ``model`` is provided, each dataset row triggers a full agent run inside deepeval's evals_iterator so plan adherence is assessed from real traces. If ``model`` is not provided, the metric operates on pre-computed columns. Args: - dataset: Dataset containing the agent input, plan, and execution steps - threshold: Minimum passing threshold (default: 0.7) - input_column: Column name for the task input (default: "input") - plan_column: Column name for the agent's plan (default: "plan") - execution_steps_column: Column name for execution steps (default: "execution_steps") - agent_output_column: Column name for agent output containing plan and steps (default: "agent_output") - tools_called_column: Column name for tools called (default: "tools_called") - strict_mode: If True, enforces a binary score (0 or 1) + dataset (VMDataset): Dataset containing agent input, plan, and execution step columns. + model (Optional[VMModel]): ValidMind agent model with a ``predict_fn``. Required for trace-based evaluation. + threshold (float): Passing threshold for adherence score. Defaults to 0.7. + input_column (str): Column name for the main agent prompt or task input. Defaults to "input". + strict_mode (bool): If True, only returns scores 0 or 1 (binary adherence). Returns: - List[Dict[str, Any]] with keys "score" and "reason" for each row. + List[Dict[str, Any]]: For each row, returns a dict with the final plan adherence "score" (float) and "reason" (str) from the metric. Raises: - ValueError: If required columns are missing - """ - # Validate required columns exist in dataset - missing_columns: List[str] = [] - if input_column not in dataset._df.columns: - missing_columns.append(input_column) - - if tools_called_column not in dataset._df.columns: - missing_columns.append(tools_called_column) + ValueError: If one or more required dataset columns are missing. + MissingDependencyError: If required deepeval dependencies are not installed. - if actual_output_column not in dataset._df.columns: - missing_columns.append(actual_output_column) - - if expected_output_column not in dataset._df.columns: - missing_columns.append(expected_output_column) - - if missing_columns: + Example: + >>> results = PlanAdherence(dataset=my_ds, model=my_agent, threshold=0.8, input_column="user_input") + >>> print(results[0]["score"], results[0]["reason"]) + """ + # Trace-based path: run agent inside evals_iterator so PlanAdherenceMetric sees the trace + if model is None or not hasattr(model, "predict_fn") or model.predict_fn is None: raise ValueError( - f"Required columns {missing_columns} not found in dataset. " - f"Available columns: {dataset._df.columns.tolist()}" + "PlanAdherence requires a `model` with a callable `predict_fn` for trace-based evaluation." ) + try: + from deepeval.dataset import EvaluationDataset, Golden + except ImportError: + raise MissingDependencyError( + "PlanAdherence with model requires deepeval.dataset.EvaluationDataset and Golden. " + "Please ensure deepeval is up to date: pip install -U deepeval", + required_dependencies=["deepeval"], + extra="llm", + ) from None - _, model = get_client_and_model() - - metric = PlanAdherenceMetric( - threshold=threshold, - model=model, - include_reason=True, - strict_mode=strict_mode, - verbose_mode=False, - ) - + _, llm_model = get_client_and_model() results: List[Dict[str, Any]] = [] - for _, row in dataset._df.iterrows(): - input_value = row[input_column] - actual_output_value = row.get(actual_output_column, "") - expected_output_value = row.get(expected_output_column, "") - - tools_called_value = row.get(tools_called_column, []) - test_case = LLMTestCase( - input=input_value, - actual_output=actual_output_value, - expected_output=expected_output_value, - tools_called=tools_called_value, + # Run one golden at a time so the metric runs after each predict_fn and + # metric.score is set when the iterator exits. + for _, row in dataset._df.iterrows(): + golden = Golden(input=row[input_column]) + eval_dataset = EvaluationDataset(goldens=[golden]) + metric = PlanAdherenceMetric( + threshold=threshold, + model=llm_model, + include_reason=True, + strict_mode=strict_mode, + verbose_mode=False, ) - - result = evaluate(test_cases=[test_case], metrics=[metric]) - metric_data = result.test_results[0].metrics_data[0] - score = metric_data.score - reason = getattr(metric_data, "reason", "No reason provided") + for golden in eval_dataset.evals_iterator(metrics=[metric]): + model.predict_fn( + { + "input": golden.input, + "session_id": str(uuid.uuid4()), + } + ) + # After the loop, iterator ran the metric for this golden; score is set. + score = metric.score + reason = getattr(metric, "reason", "No reason provided") results.append({"score": score, "reason": reason}) return results diff --git a/validmind/scorers/llm/deepeval/PlanQuality.py b/validmind/scorers/llm/deepeval/PlanQuality.py index 88ca21da3..08a0d8eb3 100644 --- a/validmind/scorers/llm/deepeval/PlanQuality.py +++ b/validmind/scorers/llm/deepeval/PlanQuality.py @@ -2,18 +2,18 @@ # Refer to the LICENSE file in the root of this repository for details. # SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial -from typing import Any, Dict, List +import uuid +from typing import Any, Dict, List, Optional from validmind import tags, tasks from validmind.ai.utils import get_client_and_model from validmind.errors import MissingDependencyError from validmind.tests.decorator import scorer +from validmind.vm_models import VMModel from validmind.vm_models.dataset import VMDataset try: - from deepeval import evaluate from deepeval.metrics import PlanQualityMetric - from deepeval.test_case import LLMTestCase, ToolCall except ImportError as e: if "deepeval" in str(e): raise MissingDependencyError( @@ -31,82 +31,89 @@ @tasks("llm") def PlanQuality( dataset: VMDataset, + model: Optional[VMModel] = None, threshold: float = 0.7, input_column: str = "input", - actual_output_column: str = "actual_output", - agent_output_column: str = "agent_output", - tools_called_column: str = "tools_called", strict_mode: bool = False, ) -> List[Dict[str, Any]]: - """Evaluates agent plan quality using deepeval's PlanQualityMetric. + """ + Evaluates the quality of an agent's generated plan for a given input using DeepEval's PlanQualityMetric. - This metric evaluates whether the plan your agent generates is logical, complete, - and efficient for accomplishing the given task. It extracts the task and plan from - your agent's trace and uses an LLM judge to assess plan quality. + This scorer measures whether each plan is logical, complete, and efficient for the agent's assigned task. + It is designed for agentic LLM trace evaluation: when a ValidMind model is provided, the agent is executed + for each dataset row within DeepEval's evals_iterator, enabling the metric to access the captured trace data. + If no model is provided, evaluation operates on pre-generated data present in the dataset, which may limit accuracy. Args: - dataset: Dataset containing the agent input and plan - threshold: Minimum passing threshold (default: 0.7) - input_column: Column name for the task input (default: "input") - agent_output_column: Column name for agent output containing plan in trace (default: "agent_output") - tools_called_column: Column name for tools called by the agent (default: "tools_called") - strict_mode: If True, enforces a binary score (0 or 1) + dataset (VMDataset): Dataset with required agent input and plan data. + model (Optional[VMModel]): ValidMind agent model with a 'predict_fn' method. Ensures evaluation is trace-based. + threshold (float): Minimum score required to pass (default: 0.7). + input_column (str): Name of the dataset column containing the primary agent input (default: "input"). + strict_mode (bool): If True, restricts scores to 0 or 1 (binary evaluation). Returns: - List[Dict[str, Any]] with keys "score" and "reason" for each row. + List[Dict[str, Any]]: For each row, a dictionary with: + - "score" (float): Quality score assigned by the metric. + - "reason" (str): Explanation from the metric regarding the plan quality. Raises: - ValueError: If required columns are missing - """ - # Validate required columns exist in dataset - missing_columns: List[str] = [] - if input_column not in dataset._df.columns: - missing_columns.append(input_column) + ValueError: If required columns are not present in the dataset. + MissingDependencyError: If the 'deepeval' package or its required objects are not installed. - if tools_called_column not in dataset._df.columns: - missing_columns.append(tools_called_column) + Example: + >>> results = PlanQuality(dataset=my_dataset, model=my_agent) + >>> print(results[0]["score"], results[0]["reason"]) - if actual_output_column not in dataset._df.columns: - missing_columns.append(actual_output_column) + Test Purpose: + Validates agentic reasoning quality by ensuring generated plans are robust, actionable, and well-aligned + with intended tasks. This strengthens trust in agent-based LLM workflows by systematically assessing + plan structure and effectiveness. - if missing_columns: + Interpretation: + - High "score" values indicate strong plan quality as judged by the LLM. + - The "reason" field contains diagnostic information about what contributed to the score. + """ + # Trace-based path: run agent inside evals_iterator so PlanQualityMetric sees the trace + if model is None or not hasattr(model, "predict_fn") or model.predict_fn is None: raise ValueError( - f"Required columns {missing_columns} not found in dataset. " - f"Available columns: {dataset._df.columns.tolist()}" + "PlanQuality requires a `model` with a callable `predict_fn` for trace-based evaluation." ) + try: + from deepeval.dataset import EvaluationDataset, Golden + except ImportError: + raise MissingDependencyError( + "PlanQuality with model requires deepeval.dataset.EvaluationDataset and Golden. " + "Please ensure deepeval is up to date: pip install -U deepeval", + required_dependencies=["deepeval"], + extra="llm", + ) from None - _, model = get_client_and_model() - - metric = PlanQualityMetric( - threshold=threshold, - model=model, - include_reason=True, - strict_mode=strict_mode, - verbose_mode=False, - ) - + get_client_and_model() results: List[Dict[str, Any]] = [] + _, llm_model = get_client_and_model() + + # Run one golden at a time so the metric runs after each predict_fn and + # metric.score is set when the iterator exits (evals_iterator runs the + # metric when advancing; with one golden we get one iteration then exit). for _, row in dataset._df.iterrows(): - input_value = row[input_column] - actual_output_value = row.get(actual_output_column, "") - tools_called_value = row.get(tools_called_column, []) - if not isinstance(tools_called_value, list) or not all( - isinstance(tool, ToolCall) for tool in tools_called_value - ): - from validmind.scorers.llm.deepeval import _convert_to_tool_call_list - - tools_called_value = _convert_to_tool_call_list(tools_called_value) - test_case = LLMTestCase( - input=input_value, - actual_output=actual_output_value, - tools_called=tools_called_value, - _trace_dict=row.get(agent_output_column, {}), + golden = Golden(input=row[input_column]) + eval_dataset = EvaluationDataset(goldens=[golden]) + metric = PlanQualityMetric( + threshold=threshold, + model=llm_model, + include_reason=True, + strict_mode=strict_mode, ) - - result = evaluate(test_cases=[test_case], metrics=[metric]) - metric_data = result.test_results[0].metrics_data[0] - score = metric_data.score - reason = getattr(metric_data, "reason", "No reason provided") + for golden in eval_dataset.evals_iterator(metrics=[metric]): + model.predict_fn( + { + "input": golden.input, + "session_id": str(uuid.uuid4()), + } + ) + # After the loop, iterator ran the metric for this golden; score is set. + score = metric.score + reason = getattr(metric, "reason", "No reason provided") results.append({"score": score, "reason": reason}) return results diff --git a/validmind/scorers/llm/deepeval/TaskCompletion.py b/validmind/scorers/llm/deepeval/TaskCompletion.py index 060cce6a8..1fb70d7ee 100644 --- a/validmind/scorers/llm/deepeval/TaskCompletion.py +++ b/validmind/scorers/llm/deepeval/TaskCompletion.py @@ -2,18 +2,18 @@ # Refer to the LICENSE file in the root of this repository for details. # SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial -from typing import Any, Dict, List +import uuid +from typing import Any, Dict, List, Optional from validmind import tags, tasks from validmind.ai.utils import get_client_and_model from validmind.errors import MissingDependencyError from validmind.tests.decorator import scorer +from validmind.vm_models import VMModel from validmind.vm_models.dataset import VMDataset try: - from deepeval import evaluate from deepeval.metrics import TaskCompletionMetric - from deepeval.test_case import LLMTestCase, ToolCall except ImportError as e: if "deepeval" in str(e): raise MissingDependencyError( @@ -26,130 +26,30 @@ raise e -def _extract_tool_responses(messages: List[Any]) -> Dict[str, str]: - """Extract tool responses from messages.""" - tool_responses = {} - - for message in messages: - # Handle both object and dictionary formats - if isinstance(message, dict): - # Dictionary format - if ( - message.get("name") - and message.get("content") - and message.get("tool_call_id") - ): - tool_responses[message["tool_call_id"]] = message["content"] - else: - # Object format - if hasattr(message, "name") and hasattr(message, "content"): - if hasattr(message, "tool_call_id"): - tool_responses[message.tool_call_id] = message.content - - return tool_responses - - -def _extract_tool_calls_from_message( - message: Any, tool_responses: Dict[str, str] -) -> List[ToolCall]: - """Extract tool calls from a single message.""" - tool_calls = [] - - # Handle both object and dictionary formats - if isinstance(message, dict): - # Dictionary format - if message.get("tool_calls"): - for tool_call in message["tool_calls"]: - tool_name = tool_call.get("name") - tool_args = tool_call.get("args", {}) - tool_id = tool_call.get("id") - - if tool_name and tool_id: - # Get the response for this tool call - response = tool_responses.get(tool_id, "") - - # Create ToolCall object - tool_call_obj = ToolCall( - name=tool_name, input_parameters=tool_args, output=response - ) - tool_calls.append(tool_call_obj) - else: - # Object format - if hasattr(message, "tool_calls") and message.tool_calls: - for tool_call in message.tool_calls: - # Handle both dictionary and object formats - if isinstance(tool_call, dict): - tool_name = tool_call.get("name") - tool_args = tool_call.get("args", {}) - tool_id = tool_call.get("id") - else: - # ToolCall object - tool_name = getattr(tool_call, "name", None) - tool_args = getattr(tool_call, "args", {}) - tool_id = getattr(tool_call, "id", None) - - if tool_name and tool_id: - # Get the response for this tool call - response = tool_responses.get(tool_id, "") - - # Create ToolCall object - tool_call_obj = ToolCall( - name=tool_name, input_parameters=tool_args, output=response - ) - tool_calls.append(tool_call_obj) - - return tool_calls - - -def extract_tool_calls_from_agent_output( - agent_output: Dict[str, Any] -) -> List[ToolCall]: - """ - Extract tool calls from the banking_agent_model_output column. - - Args: - agent_output: The dictionary from banking_agent_model_output column - - Returns: - List of ToolCall objects with name, args, and response - """ - tool_calls = [] - - if not isinstance(agent_output, dict) or "messages" not in agent_output: - return tool_calls - - messages = agent_output["messages"] - - # First pass: collect tool responses - tool_responses = _extract_tool_responses(messages) - - # Second pass: extract tool calls and match with responses - for message in messages: - message_tool_calls = _extract_tool_calls_from_message(message, tool_responses) - tool_calls.extend(message_tool_calls) - - return tool_calls - - # Create custom ValidMind tests for DeepEval metrics @scorer() @tags("llm", "TaskCompletion", "deepeval", "agentic") @tasks("llm") def TaskCompletion( dataset: VMDataset, + model: Optional[VMModel] = None, threshold: float = 0.5, input_column: str = "input", - actual_output_column: str = "actual_output", - agent_output_column: str = "agent_output", - tools_called_column: str = "tools_called", strict_mode: bool = False, ) -> List[Dict[str, Any]]: """Evaluates agent task completion using deepeval's TaskCompletionMetric. This metric assesses whether the agent's output completes the requested task. + When ``model`` is provided, the agent is run per row inside deepeval's evals_iterator + so the metric receives trace data. Without ``model``, the dataset-only path uses + pre-computed columns. + Args: dataset: Dataset containing the agent input and final output + model: Optional ValidMind model (agent) with predict_fn. When provided, the + agent is run per row inside deepeval's evals_iterator so the metric + receives trace data. threshold: Minimum passing threshold (default: 0.5) input_column: Column name for the task input (default: "input") actual_output_column: Column for the agent's final output (default: "actual_output") @@ -161,49 +61,43 @@ def TaskCompletion( Raises: ValueError: If required columns are missing """ - - # Validate required columns exist in dataset - missing_columns: List[str] = [] - for col in [input_column, actual_output_column]: - if col not in dataset._df.columns: - missing_columns.append(col) - if missing_columns: + # Trace-based path: run agent inside evals_iterator so metric sees the trace + if model is None or not hasattr(model, "predict_fn") or model.predict_fn is None: raise ValueError( - f"Required columns {missing_columns} not found in dataset. " - f"Available columns: {dataset.df.columns.tolist()}" + "TaskCompletion requires a `model` with a callable `predict_fn` for trace-based evaluation." ) + try: + from deepeval.dataset import EvaluationDataset, Golden + except ImportError: + raise MissingDependencyError( + "TaskCompletion with model requires deepeval.dataset.EvaluationDataset and Golden. " + "Please ensure deepeval is up to date: pip install -U deepeval", + required_dependencies=["deepeval"], + extra="llm", + ) from None - _, model = get_client_and_model() - - metric = TaskCompletionMetric( - threshold=threshold, - model=model, - include_reason=True, - strict_mode=strict_mode, - verbose_mode=False, - ) - + _, llm_model = get_client_and_model() results: List[Dict[str, Any]] = [] - for _, row in dataset._df.iterrows(): - input_value = row[input_column] - actual_output_value = row[actual_output_column] - if tools_called_column in dataset._df.columns: - all_tool_calls = row[tools_called_column] - else: - agent_output = row.get(agent_output_column, {}) - all_tool_calls = extract_tool_calls_from_agent_output(agent_output) - test_case = LLMTestCase( - input=input_value, - actual_output=actual_output_value, - tools_called=all_tool_calls, - _trace_dict=row.get(agent_output_column, {}), + for _, row in dataset._df.iterrows(): + golden = Golden(input=row[input_column]) + eval_dataset = EvaluationDataset(goldens=[golden]) + metric = TaskCompletionMetric( + threshold=threshold, + model=llm_model, + include_reason=True, + strict_mode=strict_mode, + verbose_mode=False, ) - - result = evaluate(test_cases=[test_case], metrics=[metric]) - metric_data = result.test_results[0].metrics_data[0] - score = metric_data.score - reason = getattr(metric_data, "reason", "No reason provided") + for golden in eval_dataset.evals_iterator(metrics=[metric]): + model.predict_fn( + { + "input": golden.input, + "session_id": str(uuid.uuid4()), + } + ) + score = metric.score + reason = getattr(metric, "reason", "No reason provided") results.append({"score": score, "reason": reason}) return results diff --git a/validmind/scorers/llm/deepeval/ToolCorrectness.py b/validmind/scorers/llm/deepeval/ToolCorrectness.py index 69abe3bf9..c77c9e212 100644 --- a/validmind/scorers/llm/deepeval/ToolCorrectness.py +++ b/validmind/scorers/llm/deepeval/ToolCorrectness.py @@ -35,93 +35,77 @@ def ToolCorrectness( dataset: VMDataset, threshold: float = 0.7, input_column: str = "input", - expected_tools_column: str = "expected_tools", - tools_called_column: str = "tools_called", - agent_output_column: str = "agent_output", - actual_output_column: str = "actual_output", + expected_tools_called_column: str = "expected_tools", + actual_tools_called_column: str = "tools_called", ) -> List[Dict[str, Any]]: - """Evaluate tool-use correctness for LLM agents using deepeval's ToolCorrectnessMetric. + """ + Evaluate the correctness of tool usage in LLM agent tasks using DeepEval's ToolCorrectnessMetric. + + This scorer checks if the agent invoked the expected tools for each task instance. It compares + the list of tools the agent actually called to the reference (expected) tool calls on a per-row basis. + The metric is suitable for scenarios where LLM agents use tools or external functions in their responses. - This metric assesses whether the agent called the expected tools in a task, and whether - argument and response information matches the ground truth expectations. - The metric compares the tools the agent actually called to the list of expected tools - on a per-row basis. + If a `model` is supplied, it is used to predict/trace agent tool usage on the fly; otherwise, tool call columns + must be prepopulated in the dataset. Args: - dataset: VMDataset containing the agent input, expected tool calls, and actual tool calls. - threshold: Minimum passing threshold (default: 0.7). - input_column: Column containing the task input for evaluation. - expected_tools_column: Column specifying the expected tools (ToolCall/str/dict or list). - tools_called_column: Column holding the tools actually called by the agent. - If missing, will be populated by parsing agent_output_column. - agent_output_column: Column containing agent output with tool-calling trace (default: "agent_output"). - actual_output_column: Column specifying the ground-truth output string (optional). + dataset (VMDataset): Dataset containing task inputs, expected tool calls, and actual tool calls. + threshold (float, optional): Passing threshold for tool match (default: 0.7). + input_column (str, optional): Name of the column containing the LLM input prompt (default: "input"). + expected_tools_called_column (str, optional): Column with reference/expected tool calls. + actual_tools_called_column (str, optional): Column with agent's actual tool calls. Returns: - List of dicts (one per row) containing: - - "score": Tool correctness score between 0 and 1. - - "reason": ToolCorrectnessMetric's reason or explanation. + List[Dict[str, Any]]: List of dicts, one per dataset row, with: + - "score" (float): Tool correctness score in [0,1]. 1.0 means all expected tools were called. + - "reason" (str): Explanation or diagnostic from the DeepEval metric. Raises: - ValueError: If required columns are missing from dataset. + ValueError: If any required columns are missing from the provided dataset. Example: - results = ToolCorrectness(dataset=my_data) - results[0]["score"] # 1.0 if tools called correctly, else <1.0 + >>> results = ToolCorrectness(dataset=my_data) + >>> print(results[0]["score"]) # 1.0 if tools match, <1.0 if not Risks & Limitations: - - Works best if dataset includes high-quality tool call signals & references. - - Comparison logic may be limited for atypically formatted tool call traces. + - Designed for table-formatted datasets with expected/actual tool call annotations. + - Requires unambiguous tool call information for accurate evaluation. + - May not fully support edge case tool calling formats or custom tracing logic. + """ - # Validate required columns exist in dataset + from validmind.scorers.llm.deepeval import _convert_to_tool_call_list + missing_columns: List[str] = [] if input_column not in dataset._df.columns: missing_columns.append(input_column) - if expected_tools_column not in dataset._df.columns: - missing_columns.append(expected_tools_column) - + if expected_tools_called_column not in dataset._df.columns: + missing_columns.append(expected_tools_called_column) + if actual_tools_called_column not in dataset._df.columns: + missing_columns.append(actual_tools_called_column) if missing_columns: raise ValueError( - f"Required columns {missing_columns} not found in dataset. " + f"ToolCorrectness with model requires columns {missing_columns}. " f"Available columns: {dataset._df.columns.tolist()}" ) - # Import helper functions to avoid circular import - from validmind.scorers.llm.deepeval import ( - _convert_to_tool_call_list, - extract_tool_calls_from_agent_output, - ) - - _, model = get_client_and_model() - - metric = ToolCorrectnessMetric( - threshold=threshold, - model=model, - ) - + _, llm_model = get_client_and_model() results: List[Dict[str, Any]] = [] - for _, row in dataset._df.iterrows(): - input_value = row[input_column] - expected_tools_value = row.get(expected_tools_column, []) - - # Extract tools called - if tools_called_column in dataset._df.columns: - tools_called_value = row.get(tools_called_column, []) - else: - agent_output = row.get(agent_output_column, {}) - tools_called_value = extract_tool_calls_from_agent_output(agent_output) + for _, row in dataset._df.iterrows(): + expected_tools_value = row.get(expected_tools_called_column, []) expected_tools_list = _convert_to_tool_call_list(expected_tools_value) - tools_called_list = _convert_to_tool_call_list(tools_called_value) + actual_tools_value = row.get(actual_tools_called_column, []) + actual_tools_list = _convert_to_tool_call_list(actual_tools_value) - actual_output_value = row.get(actual_output_column, "") + metric = ToolCorrectnessMetric( + threshold=threshold, + model=llm_model, + ) test_case = LLMTestCase( - input=input_value, + input=row[input_column], expected_tools=expected_tools_list, - tools_called=tools_called_list, - actual_output=actual_output_value, - _trace_dict=row.get(agent_output_column, {}), + tools_called=actual_tools_list, ) result = evaluate(test_cases=[test_case], metrics=[metric]) @@ -129,5 +113,4 @@ def ToolCorrectness( score = metric_data.score reason = getattr(metric_data, "reason", "No reason provided") results.append({"score": score, "reason": reason}) - return results diff --git a/validmind/tests/data_validation/WOEBinPlots.py b/validmind/tests/data_validation/WOEBinPlots.py index 4faaeb5ff..d52c8de85 100644 --- a/validmind/tests/data_validation/WOEBinPlots.py +++ b/validmind/tests/data_validation/WOEBinPlots.py @@ -15,14 +15,12 @@ try: import scorecardpy as sc except ImportError as e: - if "scorecardpy" in str(e): - raise MissingDependencyError( - "Missing required package `scorecardpy` for WOEBinPlots. " - "Please run `pip install validmind[credit_risk]` to use these tests", - required_dependencies=["scorecardpy"], - extra="credit_risk", - ) from e - raise e + raise MissingDependencyError( + "Missing required package `scorecardpy` for WOEBinPlots. " + "Please run `pip install validmind[credit_risk]` to use these tests", + required_dependencies=["scorecardpy"], + extra="credit_risk", + ) from e from plotly.subplots import make_subplots from validmind import RawData, tags, tasks diff --git a/validmind/tests/data_validation/WOEBinTable.py b/validmind/tests/data_validation/WOEBinTable.py index dc85fd0b3..d2038eab8 100644 --- a/validmind/tests/data_validation/WOEBinTable.py +++ b/validmind/tests/data_validation/WOEBinTable.py @@ -13,14 +13,12 @@ try: import scorecardpy as sc except ImportError as e: - if "scorecardpy" in str(e): - raise MissingDependencyError( - "Missing required package `scorecardpy` for WOEBinTable. " - "Please run `pip install validmind[credit_risk]` to use these tests", - required_dependencies=["scorecardpy"], - extra="credit_risk", - ) from e - raise e + raise MissingDependencyError( + "Missing required package `scorecardpy` for WOEBinTable. " + "Please run `pip install validmind[credit_risk]` to use these tests", + required_dependencies=["scorecardpy"], + extra="credit_risk", + ) from e @tags("tabular_data", "categorical_data")