Add trace-based evaluation in agent evaluation metrics to get correct score by AnilSorathiya · Pull Request #478 · validmind/validmind-library

AnilSorathiya · 2026-02-13T15:09:23Z

What and why?

Previously, agent evaluation metrics (PlanQuality, PlanAdherence, TaskCompletion) relied on pre-computed columns in the dataset. DeepEval's trace-based metrics need real trace data captured during agent execution to return correct scores.

This PR adds trace-based evaluation: when you pass a model (agent with predict_fn), the agent runs inside DeepEval's evals_iterator so the metric receives live trace data. Before: metrics used static columns. After: metrics run the agent per row and evaluate from actual traces for more accurate scores.

Changes include:

PlanQuality and PlanAdherence now require a model with predict_fn and run the agent per row inside evals_iterator.
TaskCompletion supports the same trace-based path when model is provided.
ToolCorrectness and ArgumentCorrectness use simpler parameter names (actual_tools_called_column, expected_tools_called_column).
Banking tools and agent are instrumented with @observe from DeepEval for trace capture.
The agent uses update_current_span to feed trace data to DeepEval metrics that need it.

Dependency of lib has been fixed to fix the failure in actions:
https://github.com/validmind/validmind-library/actions/runs/21994530314/job/63771440702

======================================================================
ERROR: test_WOEBinPlots (unittest.loader._FailedTest)
----------------------------------------------------------------------
ImportError: Failed to import test module: test_WOEBinPlots
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.10.19/x64/lib/python3.10/unittest/loader.py", line 154, in loadTestsFromName
    module = __import__(module_name)
  File "/home/runner/work/validmind-library/validmind-library/tests/unit_tests/data_validation/test_WOEBinPlots.py", line 6, in <module>
    from validmind.tests.data_validation.WOEBinPlots import WOEBinPlots
  File "/home/runner/work/validmind-library/validmind-library/validmind/tests/data_validation/WOEBinPlots.py", line 25, in <module>
    raise e
  File "/home/runner/work/validmind-library/validmind-library/validmind/tests/data_validation/WOEBinPlots.py", line 16, in <module>
    import scorecardpy as sc
  File "/opt/hostedtoolcache/Python/3.10.19/x64/lib/python3.10/site-packages/scorecardpy/__init__.py", line 3, in <module>
    from scorecardpy.germancredit import germancredit
  File "/opt/hostedtoolcache/Python/3.10.19/x64/lib/python3.10/site-packages/scorecardpy/germancredit.py", line 5, in <module>
    import pkg_resources
ModuleNotFoundError: No module named 'pkg_resources'

How to test

Open and run notebooks/code_samples/agents/document_agentic_ai.ipynb.
Ensure the agent runs and metrics execute without errors.
Verify PlanQuality, PlanAdherence, ToolCorrectness, ArgumentCorrectness, and TaskCompletion return scores and reasons.
Confirm the BoxPlot for TaskCompletion scores renders correctly.

What needs special review?

Trace-based vs dataset-only path in each metric.
Correct use of DeepEval's evals_iterator and update_current_span.
Removal of agent_output_column and other column parameters where trace-based flow replaces them.

Dependencies, breaking changes, and deployment notes

Release notes

Suggested label: enhancement or breaking-change

Checklist

… score

…ation-in-agent-evaluation

notebooks/use_cases/agents/document_agentic_ai.ipynb

juanmleng · 2026-02-17T15:48:21Z

Just a quick side note: Faithfulness is coming out pretty low in most cases. This is because the tool_messages contain very concise information, but the agent answers add lots of extra stuff that isn’t in those tool outputs. RAGAS treats that extra detail as unsupported claims, so the score drops. Might be an interesting question of how to properly evaluate faithfulness in these agentic use cases.

juanmleng

Looks great. I just left a suggestion to fix a small issue with PlanQuality and PlanAdherence outputs.

cachafla

Approved, but please verify why setuptools is required here?

…ation-in-agent-evaluation

github-actions · 2026-02-19T14:41:20Z

PR Summary

This PR introduces several major enhancements and bug fixes across the project. Changes include:

In the agent use cases (e.g. in notebooks/use_cases/agents/banking_tools.py), the new tracing decorator (@observe(type="tool")) has been added to various tool functions. This allows for better tracing and observability of tool calls during agent execution.
The DeepEval scoring functions in the LLM scorers (such as ArgumentCorrectness, PlanAdherence, PlanQuality, TaskCompletion, and ToolCorrectness) have been updated to use a trace-based evaluation pattern via an evals_iterator. In the new iteration, each dataset row is processed by running the agent (via the model’s predict_fn) and assigning the score after iterating over a single golden example. This change enhances the integration between DeepEval and the ValidMind framework.
Several column name parameters have been standardized (for example, changing from tools_called_column to actual_tools_called_column and from expected_tools_column to expected_tools_called_column in the ToolCorrectness scorer) to improve clarity and consistency in the evaluation of tool calls.
The dependency on the credit risk package has been tightened by locking the version of scorecardpy (now using exactly "scorecardpy==0.1.9.6") in both the pyproject.toml and poetry.lock. Corresponding unit tests in the data validation module now skip execution when scorecardpy is not installed, using a skip-if condition.
In modules such as validmind/datasets/credit_risk/lending_club.py and the test files (e.g. WOEBinPlots.py and WOEBinTable.py), error handling has been refined so that a MissingDependencyError is raised with a clear message if scorecardpy is missing.

Overall, these changes strengthen the integration with DeepEval, improve agent traceability and scoring reliability, and ensure that dependency versions remain consistent across environments.

Test Suggestions

Run all unit tests (especially the newly added or modified DeepEval scorer tests) to confirm that scores and reasons are computed correctly.
Execute integration tests for agent tools (banking_tools.py) to verify that the @observe decorator and tracing functionalities are logging as expected.
Perform end-to-end runs of the credit risk demos and documentation notebooks to ensure that the fixed dependency versions (scorecardpy==0.1.9.6) work seamlessly.
Test the behavior when the required dependency (scorecardpy) is not installed to ensure that tests are properly skipped and that MissingDependencyErrors are raised with the correct messages.

Add trace-based evaluation in agent evaluation metrics to get correct…

065f722

… score

AnilSorathiya added internal Not to be externalized in the release notes chore Chore tasks that aren't bugs or new features labels Feb 13, 2026

AnilSorathiya added 2 commits February 13, 2026 15:16

check reason is provided

421bc73

Merge branch 'main' into anilsorathiya/sc-14625/add-trace-based-evalu…

4692cf0

…ation-in-agent-evaluation

AnilSorathiya requested review from cachafla and juanmleng February 13, 2026 16:19

AnilSorathiya added 3 commits February 13, 2026 16:31

remove credentials

0a8e215

fix the failure due to dependency

dead2cb

fix the failure due to dependency

3dfe553

juanmleng reviewed Feb 17, 2026

View reviewed changes

notebooks/use_cases/agents/document_agentic_ai.ipynb Outdated Show resolved Hide resolved

juanmleng approved these changes Feb 17, 2026

View reviewed changes

cachafla approved these changes Feb 18, 2026

View reviewed changes

mdeyell-valid-mind mentioned this pull request Feb 18, 2026

Add optional model doc slug field #479

Merged

11 tasks

AnilSorathiya added 3 commits February 19, 2026 12:44

Merge branch 'main' into anilsorathiya/sc-14625/add-trace-based-evalu…

e9e36e2

…ation-in-agent-evaluation

fix scorecardpy version to 0.1.9.6

051d157

fix copyright

5f586ca

AnilSorathiya merged commit eb08648 into main Feb 19, 2026
17 checks passed

AnilSorathiya deleted the anilsorathiya/sc-14625/add-trace-based-evaluation-in-agent-evaluation branch February 19, 2026 15:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add trace-based evaluation in agent evaluation metrics to get correct score#478

Add trace-based evaluation in agent evaluation metrics to get correct score#478
AnilSorathiya merged 9 commits intomainfrom
anilsorathiya/sc-14625/add-trace-based-evaluation-in-agent-evaluation

AnilSorathiya commented Feb 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

juanmleng commented Feb 17, 2026

Uh oh!

juanmleng left a comment

Uh oh!

cachafla left a comment

Uh oh!

github-actions bot commented Feb 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

AnilSorathiya commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What and why?

How to test

What needs special review?

Dependencies, breaking changes, and deployment notes

Release notes

Checklist

Uh oh!

Uh oh!

juanmleng commented Feb 17, 2026

Uh oh!

juanmleng left a comment

Choose a reason for hiding this comment

Uh oh!

cachafla left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Feb 19, 2026

PR Summary

Test Suggestions

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AnilSorathiya commented Feb 13, 2026 •

edited

Loading