Crafting a product display webpage from a source product image, along with layout and visual content instructions, holds significant practical value for domains such as marketing, advertising, and E-commerce. Intuitively, this task demands strict visual consistency across product displays and high-fidelity instruction following to jointly generate renderable HTML code. These requirements on controllability and instruction-following are closely aligned with the core features of advanced multimodal generative models, such as image editing models and unified models (UMs). % which encompass controllable and consistent generation, instruction following To this end, this paper introduces ProductWebGen to systematically benchmark the product webpage generation capacities of these models. We organize ProductWebGen with 500 test samples covering 13 product categories; each sample consists of a source image, a visual content instruction, and a webpage instruction. The task is to generate a product showcase webpage including multiple consistent images in accordance with the source image and instructions. Given the mixed-modality input-output nature of the task, we design and systematically compare two workflows for evaluation---one uses large language models (LLMs) and image editing models to separately generate HTML code and images (editing-based), %while the other relies on UMs for co-generation (UM-based). while the other relies on a single UM to generate both, with image generation conditioned on the preceding multimodal context (UM-based). Empirical results show that editing-based approaches achieve leading results in webpage instruction following and content appeal, while UM-based ones may display more advantages in fulfilling visual content instructions. We also construct a supervised fine-tuning (SFT) dataset, ProductWebGen-1k, with 1,000 groups of real product images and LLM-generated HTML code. We verify its effectiveness on the open-source UM BAGEL.
We evaluate two primary workflows for multimodal webpage generation: the Editing-based approach and the UM-based approach. The key difference lies in how the multiple images are generated.
- Editing-based: An LLM first generates the complete HTML code along with textual descriptions for the images (often in
alttags). These descriptions are then fed, along with the source image, into a specialized image editing model to produce the final images. - UM-based: A Unified Model (UM) generates the images by conditioning on a multimodal context, which can include the source image, previously generated images, textual descriptions and user instructions. This allows the model to maintain better consistency across images.
- Set up environment
git clone https://github.com/SJTU-DENG-Lab/ProductWebGen.git
cd ProductWebGen
conda create -n ProductWebGen python=3.10 -y
conda activate ProductWebGen
pip install -r requirements.txt
pip install ninja
pip install flash-attn==2.8.3 --no-build-isolationThe ProductWebGen environment is used for evaluation and inference with editing-based methods.
- Model-Specific Setup
- API-based Models (Gemini, etc.): All API-based models are accessed via the OpenRouter API. Please obtain your API key from OpenRouter and pass it via the
--api_keyargument in the run commands. - Open-Source UMs (BAGEL, Ovis-U1, OmniGen2): For inference with BAGEL, Ovis-U1, and OmniGen2, please refer to their respective official projects for detailed instructions on setting up the environment and downloading pre-trained model weights.
- API-based Models (Gemini, etc.): All API-based models are accessed via the OpenRouter API. Please obtain your API key from OpenRouter and pass it via the
The following commands detail how to run inference for all baseline models and how to evaluate the results.
Benchmark download: ProductWebGen on Hugging Face
Fine-tuning dataset download: ProductWebGen-1K on Hugging Face
Note: Please replace ProductWebGen_benchmark.json with the correct benchmark JSON file path, "your_model_path" with the actual path to your downloaded model weights, and "xxxx" with your OpenRouter API key.
Step 1: Generate HTML (using an LLM)
python inference/editing-based/generate_html.py --benchmark_path "ProductWebGen_benchmark.json" --model_name "x-ai/grok-4" --start 0 --end 1 --api_key "xxxx" --output_path "editing-result"Step 2: Generate Images (using an Image Editor)
python inference/editing-based/edit_source_image_qwen.py --benchmark_path "ProductWebGen_benchmark.json" --model_path your_model_path --start 0 --end 1 --output_path "editing-result"
python inference/editing-based/edit_source_image_flux.py --benchmark_path "ProductWebGen_benchmark.json" --model_path your_model_path --start 0 --end 1 --output_path "editing-result"# Gemini-2.5-Flash-Image
python inference/um-based/Gemini-2.5-flash-image/nano_banana_generate_html.py --benchmark_path "ProductWebGen_benchmark.json" --start 0 --end 1 --api_key "xxxx" --output_path "um-result"
python inference/um-based/Gemini-2.5-flash-image/nano_banana_generate_image_without_html.py --benchmark_path "ProductWebGen_benchmark.json" --start 0 --end 1 --api_key "xxxx" --output_path "um-result"
python inference/um-based/Gemini-2.5-flash-image/nano_banana_generate_image_with_html.py --benchmark_path "ProductWebGen_benchmark.json" --start 0 --end 1 --api_key "xxxx" --output_path "um-result"
# BAGEL
python inference/um-based/Bagel/bagel_generate_html.py --benchmark_path "ProductWebGen_benchmark.json" --model_path "inference/um-based/Bagel/models/BAGEL-7B-MoT" --start 0 --end 1 --output_path "bagel-result"
python inference/um-based/Bagel/bagel_generate_image_without_html.py --benchmark_path "ProductWebGen_benchmark.json" --model_path "inference/um-based/Bagel/models/BAGEL-7B-MoT" --start 0 --end 1 --output_path "bagel-result"
python inference/um-based/Bagel/bagel_generate_image_with_html.py --benchmark_path "ProductWebGen_benchmark.json" --model_path "inference/um-based/Bagel/models/BAGEL-7B-MoT" --start 0 --end 1 --output_path "bagel-result"
# Ovis-U1
python inference/um-based/Ovis-U1/ovis_generate_html.py --benchmark_path "ProductWebGen_benchmark.json" --model_path "inference/um-based/Ovis-U1/Ovis-U1-3B" --start 0 --end 1 --output_path "ovis-result"
python inference/um-based/Ovis-U1/ovis_generate_image_without_html.py --benchmark_path "ProductWebGen_benchmark.json" --model_path "inference/um-based/Ovis-U1/Ovis-U1-3B" --start 0 --end 1 --output_path "ovis-result"
python inference/um-based/Ovis-U1/ovis_generate_image_with_html.py --benchmark_path "ProductWebGen_benchmark.json" --model_path "inference/um-based/Ovis-U1/Ovis-U1-3B" --start 0 --end 1 --output_path "ovis-result"
# OmniGen2
python inference/um-based/OmniGen2/omnigen2_generate_html.py --benchmark_path "ProductWebGen_benchmark.json" --model_path your_model_path --start 0 --end 1 --output_path "omnigen2-result"
python inference/um-based/OmniGen2/omnigen2_generate_image_without_html.py --benchmark_path "ProductWebGen_benchmark.json" --model_path your_model_path --start 0 --end 1 --output_path "omnigen2-result"
python inference/um-based/OmniGen2/omnigen2_generate_image_with_html.py --benchmark_path "ProductWebGen_benchmark.json" --model_path your_model_path --start 0 --end 1 --output_path "omnigen2-result"After running inference, your output directory should be structured as follows:
example-result/
└── 1
├── 1.html
├── 1_edit_1_qwen.jpg
├── 1_edit_2_qwen.jpg
├── 1_edit_3_qwen.jpg
└── 1_edit_4_qwen.jpgRender the HTML files and take screenshots for visual evaluation, then evaluate the results.
python evaluate/screenshot.py --benchmark_path "ProductWebGen_benchmark.json" --inference_result_path "example-result" --start 0 --end 1
python evaluate/metric.py --benchmark_path "ProductWebGen_benchmark.json" --inference_result_path "example-result" --start 0 --end 1 --api_key "xxxx" --output_path "evaluate-result"We would like to sincerely thank the developers of the open-source models BAGEL, Ovis-U1, OmniGen2, Qwen-Image-Edit, FLUX.1 Kontext, as our work is heavily built upon these resources.
