Skip to content

[BOT ISSUE] Anthropic: Server-side tool content blocks (web search, code execution) not decomposed into tool spans #250

@braintrust-bot

Description

@braintrust-bot

Summary

When the Anthropic Messages API returns content blocks from server-side tools (server_tool_use, web_search_tool_result, code_execution_tool_result), the Braintrust wrapper logs them as part of the flat message.content array in the LLM span's output. No child TOOL spans are created for these server-executed tool invocations.

The usage-level metrics for server tools (server_tool_use_web_search_requests, server_tool_use_web_fetch_requests, server_tool_use_code_execution_requests) are properly captured — this gap is specifically about the lack of span-level visibility into individual server-side tool executions and their results.

What is missing

The _log_message_to_span() function (py/src/braintrust/integrations/anthropic/tracing.py, line 490) logs the entire message.content array as a single output blob:

output = {
    "role": getattr(message, "role", None),
    "content": getattr(message, "content", None),  # All content blocks, flat
    ...
}
span.log(output=output, metrics=metrics, metadata=metadata)

When Anthropic server-side tools are used, message.content contains interleaved blocks like:

Content block type Description Child TOOL span created?
server_tool_use Server-side tool invocation (search query, code to execute) No
web_search_tool_result Web search results with URLs, titles, encrypted content No
code_execution_tool_result Code execution output (stdout, stderr, files) No
text (with citations) Text with citation references to search results No

Each server_tool_use + *_tool_result pair represents a complete server-side tool invocation that could be a child TOOL span with its own input (the query/code), output (the results), and metadata (tool_use_id, error codes).

Comparison with other integrations in this repo

The Google GenAI integration (py/src/braintrust/integrations/google_genai/tracing.py) creates dedicated SpanTypeAttribute.TOOL child spans for equivalent server-side tool outputs:

  • code_execution_call / code_execution_result
  • file_search_call / file_search_result
  • url_context_call / url_context_result
  • mcp_server_tool_call / mcp_server_tool_result

Test coverage

The unit tests in test_anthropic.py validate that server_tool_use metrics are correctly extracted from the usage object (lines 170-282). However, there are no end-to-end tests (or VCR cassettes) that exercise a full server-side tool invocation and verify span-level output. The cassettes/ directory contains no recordings with server_tool_use or web_search_tool_result content blocks.

Braintrust docs status

supported (partial) — The Anthropic integration page documents server_tool_use_* metrics. It does not mention tool span decomposition for server-side tool content blocks.

Upstream sources

Local files inspected

  • py/src/braintrust/integrations/anthropic/tracing.py:
    • _log_message_to_span() (line 490) — logs entire message.content as flat output, no content block type inspection
    • Streaming path uses accumulate_event() then same _log_message_to_span() — same flat treatment
  • py/src/braintrust/integrations/anthropic/_utils.py:
    • extract_anthropic_usage() (line 56) — properly captures server_tool_use metrics (web_search_requests, web_fetch_requests, code_execution_requests)
  • py/src/braintrust/integrations/anthropic/test_anthropic.py — tests validate metrics extraction only; no end-to-end cassettes for server-side tools
  • py/src/braintrust/integrations/anthropic/cassettes/ — no cassettes with server-side tool content blocks
  • py/src/braintrust/integrations/google_genai/tracing.py (lines 771-838) — creates SpanTypeAttribute.TOOL child spans for equivalent tool outputs (for comparison)

Metadata

Metadata

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions