Skip to content

[BOT ISSUE] OpenAI: Responses API built-in tool output items not decomposed into tool spans #249

@braintrust-bot

Description

@braintrust-bot

Summary

When the OpenAI Responses API returns output items from built-in tools (web_search_call, file_search_call, code_interpreter_call, computer_call, image_generation_call, mcp_call), the Braintrust wrapper stores them as opaque entries in the LLM span's output array. No child TOOL spans are created for these server-side tool invocations.

This contrasts with how other integrations in this same repo handle equivalent tool outputs — notably the Google GenAI integration, which creates dedicated SpanTypeAttribute.TOOL child spans for function calls, code execution, file search, URL context, and MCP tool calls.

What is missing

The ResponseWrapper class (py/src/braintrust/integrations/openai/tracing.py) handles all response output items uniformly:

  • Non-streaming (_parse_event_from_result, line 763): stores result["output"] as the span's output field with no item-type inspection.
  • Streaming (_postprocess_streaming_results, line 783): accumulates output items from streaming events into a flat list. Tracks response.output_item.added and content deltas, but does not differentiate by item type.

Built-in tool output items that should produce child TOOL spans:

Output item type Description Child span created?
web_search_call Server-side web search with results No
file_search_call Server-side file/vector search with results No
code_interpreter_call Server-side code execution with code + outputs No
computer_call Computer use tool invocation No
image_generation_call Server-side image generation No
mcp_call MCP tool invocation with arguments + output No
function_call User-defined function call request No

Comparison with other integrations in this repo

The Google GenAI integration (py/src/braintrust/integrations/google_genai/tracing.py) creates dedicated SpanTypeAttribute.TOOL spans via _log_posthoc_interaction_tool_span and _activate_interaction_tool_span for:

  • function_call / function_result
  • code_execution_call / code_execution_result
  • file_search_call / file_search_result
  • url_context_call / url_context_result
  • mcp_server_tool_call / mcp_server_tool_result

The Claude Agent SDK, Pydantic AI, Agno, ADK, AgentScope, and OpenAI Agents SDK integrations also create dedicated tool spans.

Test coverage

There are zero tests for any built-in tool type in the Responses API test file (py/src/braintrust/integrations/openai/test_openai.py). No cassettes exist for web search, file search, code interpreter, computer use, image generation, or MCP tool responses.

Braintrust docs status

not_found — The OpenAI integration page does not mention Responses API built-in tools, tool span decomposition, or server-side tool instrumentation.

Upstream sources

  • OpenAI Python SDK response output item types: openai/types/responses/ — defines response_web_search_call_*, response_file_search_call_*, response_code_interpreter_call_*, response_computer_tool_call*, response_image_gen_call_*, response_mcp_call_* event and item types (220+ type files).
  • OpenAI built-in tools documentation: https://platform.openai.com/docs/guides/tools

Local files inspected

  • py/src/braintrust/integrations/openai/tracing.py:
    • ResponseWrapper._parse_event_from_result() (line 763) — stores output as flat blob, no type inspection
    • ResponseWrapper._postprocess_streaming_results() (line 783) — accumulates output items generically, no type-specific handling for tool items
  • py/src/braintrust/integrations/openai/patchers.py (lines 282-347) — patches create() and parse() only
  • py/src/braintrust/integrations/openai/test_openai.py — no tests for any built-in tool type
  • py/src/braintrust/integrations/google_genai/tracing.py (lines 771-838) — creates SpanTypeAttribute.TOOL child spans for equivalent tool outputs (for comparison)

Metadata

Metadata

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions