Summary
When the Anthropic Messages API returns content blocks from server-side tools (server_tool_use, web_search_tool_result, code_execution_tool_result), the Braintrust wrapper logs them as part of the flat message.content array in the LLM span's output. No child TOOL spans are created for these server-executed tool invocations.
The usage-level metrics for server tools (server_tool_use_web_search_requests, server_tool_use_web_fetch_requests, server_tool_use_code_execution_requests) are properly captured — this gap is specifically about the lack of span-level visibility into individual server-side tool executions and their results.
What is missing
The _log_message_to_span() function (py/src/braintrust/integrations/anthropic/tracing.py, line 490) logs the entire message.content array as a single output blob:
output = {
"role": getattr(message, "role", None),
"content": getattr(message, "content", None), # All content blocks, flat
...
}
span.log(output=output, metrics=metrics, metadata=metadata)
When Anthropic server-side tools are used, message.content contains interleaved blocks like:
| Content block type |
Description |
Child TOOL span created? |
server_tool_use |
Server-side tool invocation (search query, code to execute) |
No |
web_search_tool_result |
Web search results with URLs, titles, encrypted content |
No |
code_execution_tool_result |
Code execution output (stdout, stderr, files) |
No |
text (with citations) |
Text with citation references to search results |
No |
Each server_tool_use + *_tool_result pair represents a complete server-side tool invocation that could be a child TOOL span with its own input (the query/code), output (the results), and metadata (tool_use_id, error codes).
Comparison with other integrations in this repo
The Google GenAI integration (py/src/braintrust/integrations/google_genai/tracing.py) creates dedicated SpanTypeAttribute.TOOL child spans for equivalent server-side tool outputs:
code_execution_call / code_execution_result
file_search_call / file_search_result
url_context_call / url_context_result
mcp_server_tool_call / mcp_server_tool_result
Test coverage
The unit tests in test_anthropic.py validate that server_tool_use metrics are correctly extracted from the usage object (lines 170-282). However, there are no end-to-end tests (or VCR cassettes) that exercise a full server-side tool invocation and verify span-level output. The cassettes/ directory contains no recordings with server_tool_use or web_search_tool_result content blocks.
Braintrust docs status
supported (partial) — The Anthropic integration page documents server_tool_use_* metrics. It does not mention tool span decomposition for server-side tool content blocks.
Upstream sources
Local files inspected
py/src/braintrust/integrations/anthropic/tracing.py:
_log_message_to_span() (line 490) — logs entire message.content as flat output, no content block type inspection
- Streaming path uses
accumulate_event() then same _log_message_to_span() — same flat treatment
py/src/braintrust/integrations/anthropic/_utils.py:
extract_anthropic_usage() (line 56) — properly captures server_tool_use metrics (web_search_requests, web_fetch_requests, code_execution_requests)
py/src/braintrust/integrations/anthropic/test_anthropic.py — tests validate metrics extraction only; no end-to-end cassettes for server-side tools
py/src/braintrust/integrations/anthropic/cassettes/ — no cassettes with server-side tool content blocks
py/src/braintrust/integrations/google_genai/tracing.py (lines 771-838) — creates SpanTypeAttribute.TOOL child spans for equivalent tool outputs (for comparison)
Summary
When the Anthropic Messages API returns content blocks from server-side tools (
server_tool_use,web_search_tool_result,code_execution_tool_result), the Braintrust wrapper logs them as part of the flatmessage.contentarray in the LLM span's output. No childTOOLspans are created for these server-executed tool invocations.The usage-level metrics for server tools (
server_tool_use_web_search_requests,server_tool_use_web_fetch_requests,server_tool_use_code_execution_requests) are properly captured — this gap is specifically about the lack of span-level visibility into individual server-side tool executions and their results.What is missing
The
_log_message_to_span()function (py/src/braintrust/integrations/anthropic/tracing.py, line 490) logs the entiremessage.contentarray as a single output blob:When Anthropic server-side tools are used,
message.contentcontains interleaved blocks like:server_tool_useweb_search_tool_resultcode_execution_tool_resulttext(withcitations)Each
server_tool_use+*_tool_resultpair represents a complete server-side tool invocation that could be a childTOOLspan with its own input (the query/code), output (the results), and metadata (tool_use_id, error codes).Comparison with other integrations in this repo
The Google GenAI integration (
py/src/braintrust/integrations/google_genai/tracing.py) creates dedicatedSpanTypeAttribute.TOOLchild spans for equivalent server-side tool outputs:code_execution_call/code_execution_resultfile_search_call/file_search_resulturl_context_call/url_context_resultmcp_server_tool_call/mcp_server_tool_resultTest coverage
The unit tests in
test_anthropic.pyvalidate that server_tool_use metrics are correctly extracted from the usage object (lines 170-282). However, there are no end-to-end tests (or VCR cassettes) that exercise a full server-side tool invocation and verify span-level output. Thecassettes/directory contains no recordings withserver_tool_useorweb_search_tool_resultcontent blocks.Braintrust docs status
supported (partial) — The Anthropic integration page documents server_tool_use_* metrics. It does not mention tool span decomposition for server-side tool content blocks.
Upstream sources
web_search_20250305,web_search_20260209(with dynamic filtering)server_tool_use(query) +web_search_tool_result(results with URLs, titles, encrypted_content)textblocks withcitationsarray containingweb_search_result_locationentriescode_execution_20250825server_tool_use(code) +code_execution_tool_result(stdout, stderr, files)Local files inspected
py/src/braintrust/integrations/anthropic/tracing.py:_log_message_to_span()(line 490) — logs entiremessage.contentas flat output, no content block type inspectionaccumulate_event()then same_log_message_to_span()— same flat treatmentpy/src/braintrust/integrations/anthropic/_utils.py:extract_anthropic_usage()(line 56) — properly capturesserver_tool_usemetrics (web_search_requests, web_fetch_requests, code_execution_requests)py/src/braintrust/integrations/anthropic/test_anthropic.py— tests validate metrics extraction only; no end-to-end cassettes for server-side toolspy/src/braintrust/integrations/anthropic/cassettes/— no cassettes with server-side tool content blockspy/src/braintrust/integrations/google_genai/tracing.py(lines 771-838) — createsSpanTypeAttribute.TOOLchild spans for equivalent tool outputs (for comparison)