Skip to content

Add Evo2-7B contrib model (StripedHyena 2 DNA language model)#123

Open
jimburtoft wants to merge 1 commit intoaws-neuron:mainfrom
jimburtoft:contrib/evo2
Open

Add Evo2-7B contrib model (StripedHyena 2 DNA language model)#123
jimburtoft wants to merge 1 commit intoaws-neuron:mainfrom
jimburtoft:contrib/evo2

Conversation

@jimburtoft
Copy link
Copy Markdown
Contributor

Summary

  • Adds Arc Institute's Evo 2 (7B, StripedHyena 2 architecture) DNA language model to contrib
  • Hybrid SSM+Attention architecture with custom NeuronFFT (Stockham radix-2) replacing unsupported torch.fft.*
  • Vanilla torch_neuronx.trace() compilation with block-by-block prefill (32 NEFFs) and batched decode with HBM-resident state

Model Details

  • Source: arcinstitute/evo2 (7B params, Apache 2.0)
  • Architecture: 32-layer StripedHyena 2 (9 HCS + 9 HCM + 9 HCL/SSM + 5 ATT blocks)
  • Tokenizer: Single-nucleotide (A, C, G, T), vocab=512
  • Instance: trn2.3xlarge (LNC=2)

Performance

Config Throughput Notes
Prefill (seq=2048) 413 tok/s Cosine sim 0.99968 vs CPU
Decode BS=32, 1 core 15.6 tok/s 100% token match
Decode BS=32, DP=4 63.0 tok/s Near-perfect 4.04x scaling

Key Technical Contributions

  • NeuronFFT: Pure-PyTorch Stockham FFT replacing torch.fft.rfft/irfft (unsupported on Neuron XLA)
  • HBM state management: KV-cache + SSM/FIR state persisted via input_output_aliases across decode steps
  • DP=4 process-per-core: Each NeuronCore runs independent inference for 4x throughput

Testing

  • 7 files: README.md, src/init.py, src/modeling_evo2.py, test/{init,integration/{init,test_model},unit/init}.py
  • Integration tests: NeuronFFT accuracy, prefill compilation + cosine sim, decode block compilation
  • All tests validated on trn2.3xlarge with SDK 2.28

Port of Arc Institute's Evo2 7B DNA language model to AWS Neuron/Trainium.
Uses component-wise tracing with NeuronFFT v2 (Stockham FFT) for SSM layers,
block-by-block compilation, and batched decode with state persistence.

Key features:
- NeuronFFT v2: Stockham auto-sort FFT replacing torch.fft (unsupported on XLA)
- 32-block prefill pipeline (seq_len up to 2048, cosine sim 0.99968 vs CPU)
- Batched decode with KV-cache + SSM state (100/100 token match vs CPU)
- DP=4 via process-per-core: 63.0 tok/s total throughput
- Self-contained single-file module (modeling_evo2.py)
- 7 integration tests (4 CPU-only FFT, 3 Neuron compilation+accuracy)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant