Add trace-based evaluation in agent evaluation metrics to get correct score#478
Conversation
…ation-in-agent-evaluation
|
Just a quick side note: Faithfulness is coming out pretty low in most cases. This is because the |
juanmleng
left a comment
There was a problem hiding this comment.
Looks great. I just left a suggestion to fix a small issue with PlanQuality and PlanAdherence outputs.
cachafla
left a comment
There was a problem hiding this comment.
Approved, but please verify why setuptools is required here?
PR SummaryThis PR introduces several major enhancements and bug fixes across the project. Changes include:
Overall, these changes strengthen the integration with DeepEval, improve agent traceability and scoring reliability, and ensure that dependency versions remain consistent across environments. Test Suggestions
|
What and why?
Previously, agent evaluation metrics (PlanQuality, PlanAdherence, TaskCompletion) relied on pre-computed columns in the dataset. DeepEval's trace-based metrics need real trace data captured during agent execution to return correct scores.
This PR adds trace-based evaluation: when you pass a
model(agent withpredict_fn), the agent runs inside DeepEval's evals_iterator so the metric receives live trace data. Before: metrics used static columns. After: metrics run the agent per row and evaluate from actual traces for more accurate scores.Changes include:
modelwithpredict_fnand run the agent per row inside evals_iterator.modelis provided.actual_tools_called_column,expected_tools_called_column).@observefrom DeepEval for trace capture.update_current_spanto feed trace data to DeepEval metrics that need it.Dependency of lib has been fixed to fix the failure in actions:
https://github.com/validmind/validmind-library/actions/runs/21994530314/job/63771440702
How to test
notebooks/code_samples/agents/document_agentic_ai.ipynb.What needs special review?
evals_iteratorandupdate_current_span.agent_output_columnand other column parameters where trace-based flow replaces them.Dependencies, breaking changes, and deployment notes
Release notes
Suggested label:
enhancementorbreaking-changeChecklist