Dynamic Context Evolution for Scalable Synthetic Data Generation

Large language models produce repetitive output when prompted independently across many batches — a phenomenon we term cross-batch mode collapse. We introduce Dynamic Context Evolution (DCE), comprising three mechanisms: (1) verbalized tail sampling, which filters high-probability candidates via model self-assessment; (2) semantic memory, which maintains a persistent embedding index to reject near-duplicates across batches; and (3) adaptive prompt evolution, which reconstructs the generation prompt each batch using memory state and rotating diversity strategies. DCE achieves 0.0% collapse versus 5.6% for naive prompting across three domains and two model families, at ~$0.50 per 1,000 candidates using only standard API calls.

Architecture

Each batch: the generator produces candidates → VTS filters obvious ideas (self-assessed probability > 0.10) → semantic memory rejects near-duplicates (cosine similarity > 0.85) → prompt evolution rewrites the next prompt using memory state and a rotating diversity strategy.

Installation

git clone https://github.com/ryanlingo/dynamic-context-evolution.git
cd dynamic-context-evolution
pip install -e .

# Or with uv:
uv sync

For downstream evaluation (DeBERTa classifier):

pip install -e ".[downstream]"

Quick Start

Copy the environment template and add your API keys:

cp .env.example .env
# Edit .env with your OPENAI_API_KEY and ANTHROPIC_API_KEY

Run a DCE generation session:

python experiments/run_exp2_comparison.py

Configuration is in config.yaml — adjust domain, batch count, thresholds, etc.

Reproducing Paper Experiments

Experiment 1 — Cross-batch mode collapse:

python experiments/run_exp1_collapse.py

Experiment 2 — DCE vs. baselines (multi-seed):

python experiments/run_multi_seed.py

Sensitivity analysis:

python experiments/run_sensitivity.py
python experiments/run_sensitivity_thresholds.py

Downstream evaluation:

python experiments/run_downstream.py

Analysis scripts in analysis/ generate all paper figures and tables.

Data

Experiment data (raw generation logs and processed embeddings) is available on the GitHub Releases page.

Citation

@article{lingo2026dynamic,
  title={Dynamic Context Evolution for Scalable Synthetic Data Generation},
  author={Lingo, Ryan and Chhajer, Rajeev},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2026}
}

License

This project is licensed under the Apache License 2.0 — see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
analysis		analysis
docs		docs
experiments		experiments
prompts		prompts
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dynamic Context Evolution for Scalable Synthetic Data Generation

Architecture

Installation

Quick Start

Reproducing Paper Experiments

Data

Citation

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Dynamic Context Evolution for Scalable Synthetic Data Generation

Architecture

Installation

Quick Start

Reproducing Paper Experiments

Data

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages