ProductWebGen: Benchmarking Multimodal Webpage Generation

📖 Abstract

Crafting a product display webpage from a source product image, along with layout and visual content instructions, holds significant practical value for domains such as marketing, advertising, and E-commerce. Intuitively, this task demands strict visual consistency across product displays and high-fidelity instruction following to jointly generate renderable HTML code. These requirements on controllability and instruction-following are closely aligned with the core features of advanced multimodal generative models, such as image editing models and unified models (UMs). % which encompass controllable and consistent generation, instruction following To this end, this paper introduces ProductWebGen to systematically benchmark the product webpage generation capacities of these models. We organize ProductWebGen with 500 test samples covering 13 product categories; each sample consists of a source image, a visual content instruction, and a webpage instruction. The task is to generate a product showcase webpage including multiple consistent images in accordance with the source image and instructions. Given the mixed-modality input-output nature of the task, we design and systematically compare two workflows for evaluation---one uses large language models (LLMs) and image editing models to separately generate HTML code and images (editing-based), %while the other relies on UMs for co-generation (UM-based). while the other relies on a single UM to generate both, with image generation conditioned on the preceding multimodal context (UM-based). Empirical results show that editing-based approaches achieve leading results in webpage instruction following and content appeal, while UM-based ones may display more advantages in fulfilling visual content instructions. We also construct a supervised fine-tuning (SFT) dataset, ProductWebGen-1k, with 1,000 groups of real product images and LLM-generated HTML code. We verify its effectiveness on the open-source UM BAGEL.

💡 Core Methodology

We evaluate two primary workflows for multimodal webpage generation: the Editing-based approach and the UM-based approach. The key difference lies in how the multiple images are generated.

Editing-based: An LLM first generates the complete HTML code along with textual descriptions for the images (often in alt tags). These descriptions are then fed, along with the source image, into a specialized image editing model to produce the final images.
UM-based: A Unified Model (UM) generates the images by conditioning on a multimodal context, which can include the source image, previously generated images, textual descriptions and user instructions. This allows the model to maintain better consistency across images.

🛠️ Environment Setup

Set up environment

git clone https://github.com/SJTU-DENG-Lab/ProductWebGen.git
cd ProductWebGen
conda create -n ProductWebGen python=3.10 -y
conda activate ProductWebGen
pip install -r requirements.txt
pip install ninja
pip install flash-attn==2.8.3 --no-build-isolation

The ProductWebGen environment is used for evaluation and inference with editing-based methods.

Model-Specific Setup
- API-based Models (Gemini, etc.): All API-based models are accessed via the OpenRouter API. Please obtain your API key from OpenRouter and pass it via the --api_key argument in the run commands.
- Open-Source UMs (BAGEL, Ovis-U1, OmniGen2): For inference with BAGEL, Ovis-U1, and OmniGen2, please refer to their respective official projects for detailed instructions on setting up the environment and downloading pre-trained model weights.

🚀 Running the Benchmark

The following commands detail how to run inference for all baseline models and how to evaluate the results.

Benchmark download: ProductWebGen on Hugging Face

Fine-tuning dataset download: ProductWebGen-1K on Hugging Face

Note: Please replace ProductWebGen_benchmark.json with the correct benchmark JSON file path, "your_model_path" with the actual path to your downloaded model weights, and "xxxx" with your OpenRouter API key.

1. Inference

Editing-based Approaches

Step 1: Generate HTML (using an LLM)

python inference/editing-based/generate_html.py --benchmark_path "ProductWebGen_benchmark.json" --model_name "x-ai/grok-4" --start 0 --end 1 --api_key "xxxx" --output_path "editing-result"

Step 2: Generate Images (using an Image Editor)

python inference/editing-based/edit_source_image_qwen.py --benchmark_path "ProductWebGen_benchmark.json" --model_path your_model_path --start 0 --end 1 --output_path "editing-result"
python inference/editing-based/edit_source_image_flux.py --benchmark_path "ProductWebGen_benchmark.json" --model_path your_model_path --start 0 --end 1 --output_path "editing-result"

UM-based Approaches

# Gemini-2.5-Flash-Image
python inference/um-based/Gemini-2.5-flash-image/nano_banana_generate_html.py --benchmark_path "ProductWebGen_benchmark.json" --start 0 --end 1 --api_key "xxxx" --output_path "um-result"
python inference/um-based/Gemini-2.5-flash-image/nano_banana_generate_image_without_html.py --benchmark_path "ProductWebGen_benchmark.json" --start 0 --end 1 --api_key "xxxx" --output_path "um-result"
python inference/um-based/Gemini-2.5-flash-image/nano_banana_generate_image_with_html.py --benchmark_path "ProductWebGen_benchmark.json" --start 0 --end 1 --api_key "xxxx" --output_path "um-result"

# BAGEL
python inference/um-based/Bagel/bagel_generate_html.py --benchmark_path "ProductWebGen_benchmark.json" --model_path "inference/um-based/Bagel/models/BAGEL-7B-MoT" --start 0 --end 1 --output_path "bagel-result"
python inference/um-based/Bagel/bagel_generate_image_without_html.py --benchmark_path "ProductWebGen_benchmark.json" --model_path "inference/um-based/Bagel/models/BAGEL-7B-MoT" --start 0 --end 1 --output_path "bagel-result"
python inference/um-based/Bagel/bagel_generate_image_with_html.py --benchmark_path "ProductWebGen_benchmark.json" --model_path "inference/um-based/Bagel/models/BAGEL-7B-MoT" --start 0 --end 1 --output_path "bagel-result"

# Ovis-U1
python inference/um-based/Ovis-U1/ovis_generate_html.py --benchmark_path "ProductWebGen_benchmark.json" --model_path "inference/um-based/Ovis-U1/Ovis-U1-3B" --start 0 --end 1 --output_path "ovis-result"
python inference/um-based/Ovis-U1/ovis_generate_image_without_html.py --benchmark_path "ProductWebGen_benchmark.json" --model_path "inference/um-based/Ovis-U1/Ovis-U1-3B" --start 0 --end 1 --output_path "ovis-result"
python inference/um-based/Ovis-U1/ovis_generate_image_with_html.py --benchmark_path "ProductWebGen_benchmark.json" --model_path "inference/um-based/Ovis-U1/Ovis-U1-3B" --start 0 --end 1 --output_path "ovis-result"

# OmniGen2
python inference/um-based/OmniGen2/omnigen2_generate_html.py --benchmark_path "ProductWebGen_benchmark.json" --model_path your_model_path --start 0 --end 1 --output_path "omnigen2-result"
python inference/um-based/OmniGen2/omnigen2_generate_image_without_html.py --benchmark_path "ProductWebGen_benchmark.json" --model_path your_model_path --start 0 --end 1 --output_path "omnigen2-result"
python inference/um-based/OmniGen2/omnigen2_generate_image_with_html.py --benchmark_path "ProductWebGen_benchmark.json" --model_path your_model_path --start 0 --end 1 --output_path "omnigen2-result"

2. Evaluation

After running inference, your output directory should be structured as follows:

example-result/
└── 1
    ├── 1.html
    ├── 1_edit_1_qwen.jpg
    ├── 1_edit_2_qwen.jpg
    ├── 1_edit_3_qwen.jpg
    └── 1_edit_4_qwen.jpg

Render the HTML files and take screenshots for visual evaluation, then evaluate the results.

python evaluate/screenshot.py --benchmark_path "ProductWebGen_benchmark.json" --inference_result_path "example-result" --start 0 --end 1
python evaluate/metric.py --benchmark_path "ProductWebGen_benchmark.json" --inference_result_path "example-result" --start 0 --end 1 --api_key "xxxx" --output_path "evaluate-result"

🙏 Acknowledgements

We would like to sincerely thank the developers of the open-source models BAGEL, Ovis-U1, OmniGen2, Qwen-Image-Edit, FLUX.1 Kontext, as our work is heavily built upon these resources.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
evaluate		evaluate
inference		inference
utils		utils
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ProductWebGen: Benchmarking Multimodal Webpage Generation

📖 Abstract

💡 Core Methodology

🛠️ Environment Setup

🚀 Running the Benchmark

1. Inference

Editing-based Approaches

UM-based Approaches

2. Evaluation

🙏 Acknowledgements

📜 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ProductWebGen: Benchmarking Multimodal Webpage Generation

📖 Abstract

💡 Core Methodology

🛠️ Environment Setup

🚀 Running the Benchmark

1. Inference

Editing-based Approaches

UM-based Approaches

2. Evaluation

🙏 Acknowledgements

📜 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages