[BOT ISSUE] Anthropic: Server-side tool content blocks (web search, code execution) not decomposed into tool spans

## Summary

When the Anthropic Messages API returns content blocks from server-side tools (`server_tool_use`, `web_search_tool_result`, `code_execution_tool_result`), the Braintrust wrapper logs them as part of the flat `message.content` array in the LLM span's output. No child `TOOL` spans are created for these server-executed tool invocations.

The **usage-level metrics** for server tools (`server_tool_use_web_search_requests`, `server_tool_use_web_fetch_requests`, `server_tool_use_code_execution_requests`) are properly captured — this gap is specifically about the lack of span-level visibility into individual server-side tool executions and their results.

## What is missing

The `_log_message_to_span()` function (`py/src/braintrust/integrations/anthropic/tracing.py`, line 490) logs the entire `message.content` array as a single output blob:

```python
output = {
    "role": getattr(message, "role", None),
    "content": getattr(message, "content", None),  # All content blocks, flat
    ...
}
span.log(output=output, metrics=metrics, metadata=metadata)
```

When Anthropic server-side tools are used, `message.content` contains interleaved blocks like:

| Content block type | Description | Child TOOL span created? |
|---|---|---|
| `server_tool_use` | Server-side tool invocation (search query, code to execute) | No |
| `web_search_tool_result` | Web search results with URLs, titles, encrypted content | No |
| `code_execution_tool_result` | Code execution output (stdout, stderr, files) | No |
| `text` (with `citations`) | Text with citation references to search results | No |

Each `server_tool_use` + `*_tool_result` pair represents a complete server-side tool invocation that could be a child `TOOL` span with its own input (the query/code), output (the results), and metadata (tool_use_id, error codes).

### Comparison with other integrations in this repo

The Google GenAI integration (`py/src/braintrust/integrations/google_genai/tracing.py`) creates dedicated `SpanTypeAttribute.TOOL` child spans for equivalent server-side tool outputs:
- `code_execution_call` / `code_execution_result`
- `file_search_call` / `file_search_result`
- `url_context_call` / `url_context_result`
- `mcp_server_tool_call` / `mcp_server_tool_result`

### Test coverage

The unit tests in `test_anthropic.py` validate that server_tool_use **metrics** are correctly extracted from the usage object (lines 170-282). However, there are no end-to-end tests (or VCR cassettes) that exercise a full server-side tool invocation and verify span-level output. The `cassettes/` directory contains no recordings with `server_tool_use` or `web_search_tool_result` content blocks.

## Braintrust docs status

**supported** (partial) — The [Anthropic integration page](https://www.braintrust.dev/docs/integrations/ai-providers/anthropic) documents server_tool_use_* metrics. It does not mention tool span decomposition for server-side tool content blocks.

## Upstream sources

- Anthropic web search tool documentation: https://docs.anthropic.com/en/docs/agents-and-tools/tool-use/web-search-tool
  - Tool types: `web_search_20250305`, `web_search_20260209` (with dynamic filtering)
  - Response content blocks: `server_tool_use` (query) + `web_search_tool_result` (results with URLs, titles, encrypted_content)
  - Citations: `text` blocks with `citations` array containing `web_search_result_location` entries
- Anthropic code execution tool documentation: https://docs.anthropic.com/en/docs/agents-and-tools/tool-use/code-execution-tool
  - Tool type: `code_execution_20250825`
  - Response content blocks: `server_tool_use` (code) + `code_execution_tool_result` (stdout, stderr, files)
- Both tools are GA and available on all supported Claude models

## Local files inspected

- `py/src/braintrust/integrations/anthropic/tracing.py`:
  - `_log_message_to_span()` (line 490) — logs entire `message.content` as flat output, no content block type inspection
  - Streaming path uses `accumulate_event()` then same `_log_message_to_span()` — same flat treatment
- `py/src/braintrust/integrations/anthropic/_utils.py`:
  - `extract_anthropic_usage()` (line 56) — properly captures `server_tool_use` metrics (web_search_requests, web_fetch_requests, code_execution_requests)
- `py/src/braintrust/integrations/anthropic/test_anthropic.py` — tests validate metrics extraction only; no end-to-end cassettes for server-side tools
- `py/src/braintrust/integrations/anthropic/cassettes/` — no cassettes with server-side tool content blocks
- `py/src/braintrust/integrations/google_genai/tracing.py` (lines 771-838) — creates `SpanTypeAttribute.TOOL` child spans for equivalent tool outputs (for comparison)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BOT ISSUE] Anthropic: Server-side tool content blocks (web search, code execution) not decomposed into tool spans #250

Summary

What is missing

Comparison with other integrations in this repo

Test coverage

Braintrust docs status

Upstream sources

Local files inspected

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Content block type	Description	Child TOOL span created?
`server_tool_use`	Server-side tool invocation (search query, code to execute)	No
`web_search_tool_result`	Web search results with URLs, titles, encrypted content	No
`code_execution_tool_result`	Code execution output (stdout, stderr, files)	No
`text` (with `citations`)	Text with citation references to search results	No

[BOT ISSUE] Anthropic: Server-side tool content blocks (web search, code execution) not decomposed into tool spans #250

Description

Summary

What is missing

Comparison with other integrations in this repo

Test coverage

Braintrust docs status

Upstream sources

Local files inspected

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions