Skip to content

Add trace-based evaluation in agent evaluation metrics to get correct score#478

Merged
AnilSorathiya merged 9 commits intomainfrom
anilsorathiya/sc-14625/add-trace-based-evaluation-in-agent-evaluation
Feb 19, 2026
Merged

Add trace-based evaluation in agent evaluation metrics to get correct score#478
AnilSorathiya merged 9 commits intomainfrom
anilsorathiya/sc-14625/add-trace-based-evaluation-in-agent-evaluation

Conversation

@AnilSorathiya
Copy link
Contributor

@AnilSorathiya AnilSorathiya commented Feb 13, 2026

What and why?

Previously, agent evaluation metrics (PlanQuality, PlanAdherence, TaskCompletion) relied on pre-computed columns in the dataset. DeepEval's trace-based metrics need real trace data captured during agent execution to return correct scores.

This PR adds trace-based evaluation: when you pass a model (agent with predict_fn), the agent runs inside DeepEval's evals_iterator so the metric receives live trace data. Before: metrics used static columns. After: metrics run the agent per row and evaluate from actual traces for more accurate scores.

Changes include:

  • PlanQuality and PlanAdherence now require a model with predict_fn and run the agent per row inside evals_iterator.
  • TaskCompletion supports the same trace-based path when model is provided.
  • ToolCorrectness and ArgumentCorrectness use simpler parameter names (actual_tools_called_column, expected_tools_called_column).
  • Banking tools and agent are instrumented with @observe from DeepEval for trace capture.
  • The agent uses update_current_span to feed trace data to DeepEval metrics that need it.

Dependency of lib has been fixed to fix the failure in actions:
https://github.com/validmind/validmind-library/actions/runs/21994530314/job/63771440702

======================================================================
ERROR: test_WOEBinPlots (unittest.loader._FailedTest)
----------------------------------------------------------------------
ImportError: Failed to import test module: test_WOEBinPlots
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.10.19/x64/lib/python3.10/unittest/loader.py", line 154, in loadTestsFromName
    module = __import__(module_name)
  File "/home/runner/work/validmind-library/validmind-library/tests/unit_tests/data_validation/test_WOEBinPlots.py", line 6, in <module>
    from validmind.tests.data_validation.WOEBinPlots import WOEBinPlots
  File "/home/runner/work/validmind-library/validmind-library/validmind/tests/data_validation/WOEBinPlots.py", line 25, in <module>
    raise e
  File "/home/runner/work/validmind-library/validmind-library/validmind/tests/data_validation/WOEBinPlots.py", line 16, in <module>
    import scorecardpy as sc
  File "/opt/hostedtoolcache/Python/3.10.19/x64/lib/python3.10/site-packages/scorecardpy/__init__.py", line 3, in <module>
    from scorecardpy.germancredit import germancredit
  File "/opt/hostedtoolcache/Python/3.10.19/x64/lib/python3.10/site-packages/scorecardpy/germancredit.py", line 5, in <module>
    import pkg_resources
ModuleNotFoundError: No module named 'pkg_resources'

How to test

  1. Open and run notebooks/code_samples/agents/document_agentic_ai.ipynb.
  2. Ensure the agent runs and metrics execute without errors.
  3. Verify PlanQuality, PlanAdherence, ToolCorrectness, ArgumentCorrectness, and TaskCompletion return scores and reasons.
  4. Confirm the BoxPlot for TaskCompletion scores renders correctly.

What needs special review?

  • Trace-based vs dataset-only path in each metric.
  • Correct use of DeepEval's evals_iterator and update_current_span.
  • Removal of agent_output_column and other column parameters where trace-based flow replaces them.

Dependencies, breaking changes, and deployment notes

Release notes

Suggested label: enhancement or breaking-change


Checklist

  • What and why
  • Screenshots or videos (Frontend)
  • How to test
  • What needs special review
  • Dependencies, breaking changes, and deployment notes
  • Labels applied
  • PR linked to Shortcut
  • Unit tests added (Backend)
  • Tested locally
  • Documentation updated (if required)
  • Environment variable additions/changes documented (if required)

@AnilSorathiya AnilSorathiya added internal Not to be externalized in the release notes chore Chore tasks that aren't bugs or new features labels Feb 13, 2026
@juanmleng
Copy link
Contributor

Just a quick side note: Faithfulness is coming out pretty low in most cases. This is because the tool_messages contain very concise information, but the agent answers add lots of extra stuff that isn’t in those tool outputs. RAGAS treats that extra detail as unsupported claims, so the score drops. Might be an interesting question of how to properly evaluate faithfulness in these agentic use cases.

Copy link
Contributor

@juanmleng juanmleng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. I just left a suggestion to fix a small issue with PlanQuality and PlanAdherence outputs.

Copy link
Contributor

@cachafla cachafla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved, but please verify why setuptools is required here?

@github-actions
Copy link
Contributor

PR Summary

This PR introduces several major enhancements and bug fixes across the project. Changes include:

  1. In the agent use cases (e.g. in notebooks/use_cases/agents/banking_tools.py), the new tracing decorator (@observe(type="tool")) has been added to various tool functions. This allows for better tracing and observability of tool calls during agent execution.

  2. The DeepEval scoring functions in the LLM scorers (such as ArgumentCorrectness, PlanAdherence, PlanQuality, TaskCompletion, and ToolCorrectness) have been updated to use a trace-based evaluation pattern via an evals_iterator. In the new iteration, each dataset row is processed by running the agent (via the model’s predict_fn) and assigning the score after iterating over a single golden example. This change enhances the integration between DeepEval and the ValidMind framework.

  3. Several column name parameters have been standardized (for example, changing from tools_called_column to actual_tools_called_column and from expected_tools_column to expected_tools_called_column in the ToolCorrectness scorer) to improve clarity and consistency in the evaluation of tool calls.

  4. The dependency on the credit risk package has been tightened by locking the version of scorecardpy (now using exactly "scorecardpy==0.1.9.6") in both the pyproject.toml and poetry.lock. Corresponding unit tests in the data validation module now skip execution when scorecardpy is not installed, using a skip-if condition.

  5. In modules such as validmind/datasets/credit_risk/lending_club.py and the test files (e.g. WOEBinPlots.py and WOEBinTable.py), error handling has been refined so that a MissingDependencyError is raised with a clear message if scorecardpy is missing.

Overall, these changes strengthen the integration with DeepEval, improve agent traceability and scoring reliability, and ensure that dependency versions remain consistent across environments.

Test Suggestions

  • Run all unit tests (especially the newly added or modified DeepEval scorer tests) to confirm that scores and reasons are computed correctly.
  • Execute integration tests for agent tools (banking_tools.py) to verify that the @observe decorator and tracing functionalities are logging as expected.
  • Perform end-to-end runs of the credit risk demos and documentation notebooks to ensure that the fixed dependency versions (scorecardpy==0.1.9.6) work seamlessly.
  • Test the behavior when the required dependency (scorecardpy) is not installed to ensure that tests are properly skipped and that MissingDependencyErrors are raised with the correct messages.

@AnilSorathiya AnilSorathiya merged commit eb08648 into main Feb 19, 2026
17 checks passed
@AnilSorathiya AnilSorathiya deleted the anilsorathiya/sc-14625/add-trace-based-evaluation-in-agent-evaluation branch February 19, 2026 15:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

chore Chore tasks that aren't bugs or new features internal Not to be externalized in the release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants