Summary
OpenAI's GPT-4o models support audio input/output in chat completions. When streaming with audio output (modalities: ["text", "audio"]), the API sends delta.audio chunks containing audio.id, audio.transcript, audio.data (base64), and audio.expires_at. The current aggregateChatCompletionChunks function only handles delta.role, delta.content, delta.tool_calls, and delta.finish_reason — it does not aggregate delta.audio at all.
This means the audio transcript (the most useful field for observability) is lost in streaming responses. Non-streaming responses capture the full response object, so audio data is preserved there. Audio token metrics (prompt_audio_tokens, completion_audio_tokens) are already correctly extracted from usage data.
What is missing
- Streaming aggregation (
js/src/instrumentation/plugins/openai-plugin.ts, aggregateChatCompletionChunks ~lines 389-466): Handles delta.role, delta.content, delta.tool_calls, and delta.finish_reason. No branch for delta.audio. Audio transcript chunks are silently dropped.
- Vendor SDK types (
js/src/vendor-sdk-types/openai-common.ts): OpenAIChatDelta interface defines role, content, tool_calls, finish_reason and a catch-all [key: string]: unknown. No explicit audio field.
- E2E tests: No scenario tests audio modality in chat completions.
Expected behavior
The streaming aggregation should at minimum concatenate delta.audio.transcript chunks so the final span output includes the model's audio transcript. Storing the full base64 audio.data in spans would be impractical (very large), but the transcript text is compact and essential for observability.
Upstream reference
- OpenAI audio output in chat completions: https://platform.openai.com/docs/guides/audio
- Streaming audio delta fields:
audio.id, audio.data (base64 chunks), audio.transcript (text chunks), audio.expires_at
- Supported on
gpt-4o-audio-preview and gpt-4o-mini-audio-preview models.
- Audio token metrics are documented in the usage object as
prompt_tokens_details.audio_tokens and completion_tokens_details.audio_tokens.
Braintrust docs status
The Braintrust OpenAI integration page documents chat completions but does not mention audio modality (not_found).
What already works
Audio token metrics are properly extracted. In js/src/openai-utils.ts, the extractOpenAIMetrics function maps input_tokens_details.audio_tokens → prompt_audio_tokens and output_tokens_details.audio_tokens → completion_audio_tokens. This is confirmed by unit tests in openai-plugin.test.ts.
Local files inspected
js/src/instrumentation/plugins/openai-plugin.ts — aggregateChatCompletionChunks function
js/src/vendor-sdk-types/openai-common.ts — OpenAIChatDelta interface
js/src/openai-utils.ts — extractOpenAIMetrics (audio token metrics)
js/src/wrappers/oai.ts — wrapper proxy
e2e/scenarios/openai-instrumentation/scenario.impl.mjs — e2e test scenarios
Summary
OpenAI's GPT-4o models support audio input/output in chat completions. When streaming with audio output (
modalities: ["text", "audio"]), the API sendsdelta.audiochunks containingaudio.id,audio.transcript,audio.data(base64), andaudio.expires_at. The currentaggregateChatCompletionChunksfunction only handlesdelta.role,delta.content,delta.tool_calls, anddelta.finish_reason— it does not aggregatedelta.audioat all.This means the audio transcript (the most useful field for observability) is lost in streaming responses. Non-streaming responses capture the full response object, so audio data is preserved there. Audio token metrics (
prompt_audio_tokens,completion_audio_tokens) are already correctly extracted from usage data.What is missing
js/src/instrumentation/plugins/openai-plugin.ts,aggregateChatCompletionChunks~lines 389-466): Handlesdelta.role,delta.content,delta.tool_calls, anddelta.finish_reason. No branch fordelta.audio. Audio transcript chunks are silently dropped.js/src/vendor-sdk-types/openai-common.ts):OpenAIChatDeltainterface definesrole,content,tool_calls,finish_reasonand a catch-all[key: string]: unknown. No explicitaudiofield.Expected behavior
The streaming aggregation should at minimum concatenate
delta.audio.transcriptchunks so the final span output includes the model's audio transcript. Storing the full base64audio.datain spans would be impractical (very large), but the transcript text is compact and essential for observability.Upstream reference
audio.id,audio.data(base64 chunks),audio.transcript(text chunks),audio.expires_atgpt-4o-audio-previewandgpt-4o-mini-audio-previewmodels.prompt_tokens_details.audio_tokensandcompletion_tokens_details.audio_tokens.Braintrust docs status
The Braintrust OpenAI integration page documents chat completions but does not mention audio modality (
not_found).What already works
Audio token metrics are properly extracted. In
js/src/openai-utils.ts, theextractOpenAIMetricsfunction mapsinput_tokens_details.audio_tokens→prompt_audio_tokensandoutput_tokens_details.audio_tokens→completion_audio_tokens. This is confirmed by unit tests inopenai-plugin.test.ts.Local files inspected
js/src/instrumentation/plugins/openai-plugin.ts—aggregateChatCompletionChunksfunctionjs/src/vendor-sdk-types/openai-common.ts—OpenAIChatDeltainterfacejs/src/openai-utils.ts—extractOpenAIMetrics(audio token metrics)js/src/wrappers/oai.ts— wrapper proxye2e/scenarios/openai-instrumentation/scenario.impl.mjs— e2e test scenarios