Skip to content

Comments

FEAT plumb media output through adversarial feedback loop in RedTeamingAttack#1377

Open
fitzpr wants to merge 6 commits intoAzure:mainfrom
fitzpr:feature/media-feedback-loop-v2
Open

FEAT plumb media output through adversarial feedback loop in RedTeamingAttack#1377
fitzpr wants to merge 6 commits intoAzure:mainfrom
fitzpr:feature/media-feedback-loop-v2

Conversation

@fitzpr
Copy link
Contributor

@fitzpr fitzpr commented Feb 19, 2026

Description

When the red teaming attack loop targets an image/video generator (e.g. DALL-E), the adversarial chat (e.g. GPT-4o) never sees the generated media - only the scorer's text feedback is sent back. The adversarial LLM refines its prompts blindly.

This PR makes media outputs flow through the feedback loop as multimodal messages (text + image/video), so the adversarial chat can see what the target actually produced and give better-informed follow-up prompts.

Before: Target → image → Scorer → "missing a hat" (text only) → Adversarial LLM
After: Target → image → Scorer → "missing a hat" + image → Adversarial LLM

Project #6a. @romanlutz for visibility.

Changes

red_teaming.py - core change (3 methods):

  • _handle_adversarial_file_response now returns a (feedback_text, media_piece) tuple instead of just a string
  • _build_adversarial_prompt returns Union[str, tuple] - string for text responses, tuple for media
  • _generate_next_prompt_async builds a multimodal Message with text + media pieces when a tuple is returned

openai_chat_target.py - 1-line bug fix:

  • Content-filtered responses have data type error, which crashed multimodal message building with ValueError: Multimodal data type error is not yet supported. Now treated as text alongside "text".

No new files, classes, or dependencies. Fully backward compatible - text-only workflows are unchanged.

Tests and Documentation

test_red_teaming.py - 5 new tests, 2 updated:

  • test_generate_next_prompt_sends_multimodal_message_for_image_response - verifies 2-piece multimodal message construction (text + image)
  • test_generate_next_prompt_sends_multimodal_message_for_video_response - same for video
  • test_generate_next_prompt_text_response_stays_text_only - regression test ensuring text path is unaffected
  • test_build_adversarial_prompt_returns_tuple_for_image_response - return type assertion
  • test_build_adversarial_prompt_returns_str_for_text_response - return type assertion
  • 2 existing tests updated to assert tuple returns from _handle_adversarial_file_response

All 161 tests pass (70 red_teaming + 91 chat_target). JupyText not run - no notebook changes in this PR.

Robert Fitzpatrick and others added 4 commits February 18, 2026 18:42
When the objective target returns non-text content (images, video, etc.),
the adversarial chat now receives a multimodal message containing both
the scorer's textual feedback AND the actual generated media. This enables
vision-capable adversarial LLMs (e.g. GPT-4o) to see what the target
produced and craft more informed follow-up prompts.

Changes:
- _handle_adversarial_file_response: returns (feedback_text, media_piece)
  tuple instead of just the feedback string
- _build_adversarial_prompt: returns Union[str, tuple] to propagate media
- _generate_next_prompt_async: constructs multimodal Message with text +
  media pieces when file response detected; text-only path unchanged

Tests:
- Updated 2 existing tests for new tuple return type
- Added 5 new tests in TestMultimodalFeedbackLoop:
  - image response produces multimodal message to adversarial chat
  - video response produces multimodal message to adversarial chat
  - text response stays text-only (no regression)
  - _build_adversarial_prompt returns tuple for image
  - _build_adversarial_prompt returns str for text
When a target response has data_type='error' (e.g. content filter block),
treat it as text in OpenAIChatTarget's multimodal message builder instead
of raising ValueError. This prevents crashes when conversation history
contains error responses from prior turns.
Copy link
Contributor

@romanlutz romanlutz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great contribution! Exactly what I was thinking of. We are missing something fundamental in PyRIT for this to work, though.

@romanlutz romanlutz changed the title FEAT plumb media output through adversarial feedback loop (#6a) FEAT plumb media output through adversarial feedback loop in RedTeamingAttack Feb 19, 2026
Co-authored-by: Roman Lutz <romanlutz13@gmail.com>
fitzpr pushed a commit to fitzpr/PyRIT that referenced this pull request Feb 19, 2026
- Add SUPPORTED_INPUT_MODALITIES class attribute to PromptTarget base class
- Add input_modality_supported() and supports_multimodal_input() methods
- Add supported_input_modalities property that returns list of supported modalities
- Add supported_input_modalities and supports_conversation_history fields to TargetIdentifier
- Update PromptTarget._create_identifier() to populate new fields
- Implement modality declarations in OpenAIChatTarget (text, image_path), TextTarget (text), and HuggingFaceChatTarget (text)
- Add comprehensive tests for modality support detection

This system enables attacks to detect whether targets support multimodal input (text + other modalities) and route accordingly, addressing the limitation mentioned in PR Azure#1377 where multimodal attacks need to know target capabilities.
Address Roman's feedback items #2 and #3:
- Change _build_adversarial_prompt to return Message instead of Union type
- Extract message construction logic into separate helper methods
- Add _build_text_message() for simple text prompts
- Add _build_multimodal_message() for media responses
- Simplify caller code by removing tuple handling logic
- Improve logging to work with Message objects

These architectural improvements prepare the code to integrate with
the modality support detection system from separate PR.
fitzpr pushed a commit to fitzpr/PyRIT that referenced this pull request Feb 19, 2026
- Add SUPPORTED_INPUT_MODALITIES class attribute to PromptTarget base class
- Add input_modality_supported() and supports_multimodal_input() methods
- Add supported_input_modalities property that returns list of supported modalities
- Add supported_input_modalities and supports_conversation_history fields to TargetIdentifier
- Update PromptTarget._create_identifier() to populate new fields
- Implement modality declarations in OpenAIChatTarget (text, image_path), TextTarget (text), and HuggingFaceChatTarget (text)
- Add comprehensive tests for modality support detection

This system enables attacks to detect whether targets support multimodal input (text + other modalities) and route accordingly, addressing the limitation mentioned in PR Azure#1377 where multimodal attacks need to know target capabilities.
fitzpr pushed a commit to fitzpr/PyRIT that referenced this pull request Feb 19, 2026
- Add SUPPORTED_INPUT_MODALITIES class attribute to PromptTarget base class
- Add input_modality_supported() and supports_multimodal_input() methods
- Add supported_input_modalities property that returns list of supported modalities
- Add supported_input_modalities and supports_conversation_history fields to TargetIdentifier
- Update PromptTarget._create_identifier() to populate new fields
- Implement modality declarations in OpenAIChatTarget (text, image_path), TextTarget (text), and HuggingFaceChatTarget (text)
- Add comprehensive tests for modality support detection

This system enables attacks to detect whether targets support multimodal input (text + other modalities) and route accordingly, addressing the limitation mentioned in PR Azure#1377 where multimodal attacks need to know target capabilities.
fitzpr pushed a commit to fitzpr/PyRIT that referenced this pull request Feb 20, 2026
Addresses all Roman's feedback from PR Azure#1377:
- Uses set[frozenset[PromptDataType]] instead of tuples
- Exact frozenset matching prevents ordering issues
- Implemented across all target types (OpenAI, HuggingFace, TextTarget)
- Future-proof pattern matching for new OpenAI models
- Optional verification utility for runtime testing
- Comprehensive test suite with 8 passing tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants