Skip to content

feat(auth): replace static hotkey/API-key auth with Bittensor validator whitelisting and 50% consensus#4

Open
echobt wants to merge 1 commit intomainfrom
feat/bittensor-validator-whitelist-consensus
Open

feat(auth): replace static hotkey/API-key auth with Bittensor validator whitelisting and 50% consensus#4
echobt wants to merge 1 commit intomainfrom
feat/bittensor-validator-whitelist-consensus

Conversation

@echobt
Copy link
Contributor

@echobt echobt commented Feb 18, 2026

Summary

Replace the static single-hotkey + API-key authentication system with dynamic validator whitelisting from the Bittensor blockchain (netuid 100) and a 50% consensus mechanism for evaluation triggering. Validators with ≥10,000 TAO stake are automatically whitelisted via bittensor-rs, and POST /submit evaluations only proceed when a majority of whitelisted validators agree on the same archive payload.

Changes

New modules

  • src/validator_whitelist.rsValidatorWhitelist backed by parking_lot::RwLock<HashSet<String>>, with a background refresh loop (every 5 min) that connects to subtensor via BittensorClient::with_failover(), syncs the metagraph for netuid 100, and filters validators by validator_permit && active && stake >= 10k TAO. Includes retry with exponential backoff (3 attempts) and graceful fallback to cached whitelist on connection failure.
  • src/consensus.rsConsensusManager using DashMap<String, PendingConsensus> keyed by SHA-256 archive hash. Tracks per-hash voter sets, enforces 50% threshold before triggering evaluation, prevents duplicate votes, caps pending entries at 100, and runs a 30-second TTL reaper for expired entries (default 60s TTL).

Modified modules

  • src/auth.rs — Removed AUTHORIZED_HOTKEY import, X-Api-Key header extraction, InvalidApiKey error variant, and API key comparison. verify_request() now accepts a &ValidatorWhitelist reference and checks hotkey membership in the dynamic whitelist. AuthHeaders reduced from 4 fields to 3.
  • src/config.rs — Removed AUTHORIZED_HOTKEY constant and worker_api_key field (no longer requires WORKER_API_KEY env var). Added configurable fields: bittensor_netuid, min_validator_stake_tao, validator_refresh_secs, consensus_threshold, consensus_ttl_secs with sensible defaults and env var overrides.
  • src/handlers.rsAppState extended with ValidatorWhitelist and ConsensusManager. submit_batch handler now: rejects requests when whitelist is empty (503), computes SHA-256 of archive, records consensus votes, returns 202 with pending status or triggers evaluation on threshold reached.
  • src/main.rs — Added module declarations, creates and spawns whitelist refresh loop and consensus reaper as background tasks, wires new state into AppState.
  • Cargo.toml — Added bittensor-rs git dependency and required [patch.crates-io] for w3f-bls.
  • Dockerfile — Added protobuf-compiler, cmake, clang build dependencies and .cargo config copy for bittensor-rs compilation.

Documentation

  • AGENTS.md and src/AGENTS.md — Updated data flow, module map, shared state descriptions, environment variables, and authentication documentation.

Breaking Changes

  • X-Api-Key header is no longer accepted or required
  • WORKER_API_KEY environment variable is no longer needed
  • Single AUTHORIZED_HOTKEY replaced by dynamic whitelist — only Bittensor netuid 100 validators with ≥10k TAO stake are authorized
  • POST /submit now returns 202 with pending_consensus status until ≥50% of validators submit the same archive

Summary by CodeRabbit

Release Notes

  • New Features

    • Batch submissions now support consensus-based validation with multi-validator voting
    • Dynamic validator whitelist with automatic periodic refresh
    • API responses include consensus status, vote counts, and validation thresholds
  • Refactor

    • Authentication mechanism changed from static API keys to dynamic validator whitelist verification
    • Configuration expanded with Bittensor network parameters and consensus settings
    • Background maintenance tasks added for whitelist updates and consensus cleanup

…or whitelisting and 50% consensus

Integrate dynamic validator whitelisting from Bittensor netuid 100 and
consensus-based evaluation triggering, replacing the previous single
AUTHORIZED_HOTKEY + WORKER_API_KEY authentication system.

Authentication now uses a dynamic whitelist of validators fetched every
5 minutes from the Bittensor blockchain via bittensor-rs. Validators
must have validator_permit, be active, and have >=10,000 TAO stake.
POST /submit requests only trigger evaluations when >=50% of whitelisted
validators submit the same archive payload (identified by SHA-256 hash).

New modules:
- src/validator_whitelist.rs: ValidatorWhitelist with parking_lot::RwLock,
  background refresh loop with 3-retry exponential backoff, connection
  resilience (keeps cached whitelist on failure), starts empty and rejects
  requests with 503 until first successful sync
- src/consensus.rs: ConsensusManager using DashMap for lock-free vote
  tracking, PendingConsensus entries with TTL (default 60s), reaper loop
  every 30s, max 100 pending entries cap, duplicate vote detection

Modified modules:
- src/auth.rs: Removed AUTHORIZED_HOTKEY import, api_key field from
  AuthHeaders, X-Api-Key header extraction, InvalidApiKey error variant.
  verify_request() now takes &ValidatorWhitelist instead of API key string.
  Updated all tests accordingly.
- src/config.rs: Removed AUTHORIZED_HOTKEY constant and worker_api_key
  field. Added bittensor_netuid, min_validator_stake_tao,
  validator_refresh_secs, consensus_threshold, consensus_ttl_secs with
  env var support and sensible defaults. Updated banner output.
- src/handlers.rs: Added ValidatorWhitelist and ConsensusManager to
  AppState. submit_batch now: checks whitelist non-empty (503), validates
  against whitelist, computes SHA-256 of archive, records consensus vote,
  returns 202 with pending status or triggers evaluation on consensus.
  Moved active batch check to consensus-reached branch only.
- src/main.rs: Added module declarations, creates ValidatorWhitelist and
  ConsensusManager, spawns background refresh and reaper tasks.
- Cargo.toml: Added bittensor-rs git dependency and mandatory
  [patch.crates-io] for w3f-bls.
- Dockerfile: Added protobuf-compiler, cmake, clang, mold build deps
  for bittensor-rs substrate dependencies. Copies .cargo config.
- AGENTS.md and src/AGENTS.md: Updated data flow, module map, env vars,
  authentication docs to reflect new architecture.

BREAKING CHANGE: WORKER_API_KEY env var and X-Api-Key header no longer required.
All validators on Bittensor netuid 100 with sufficient stake are auto-whitelisted.
@coderabbitai
Copy link

coderabbitai bot commented Feb 18, 2026

📝 Walkthrough

Walkthrough

This pull request introduces a dynamic validator whitelist and consensus-based voting mechanism to replace fixed API key authentication. New modules ValidatorWhitelist and ConsensusManager handle validator synchronization and consensus tracking, respectively, integrated with background refresh and cleanup loops and updated configuration parameters.

Changes

Cohort / File(s) Summary
Consensus & Whitelist Mechanisms
src/consensus.rs, src/validator_whitelist.rs
New modules implementing in-memory consensus voting (DashMap-backed) with TTL reaper, and dynamic validator whitelist synced from Bittensor network with periodic refresh loop and exponential backoff.
Authentication & Authorization
src/auth.rs
Removed fixed API key handling (field, parsing, error variant); replaced with whitelist-based verification checking against ValidatorWhitelist instead of static hotkey comparison.
Configuration & Startup
src/config.rs, src/main.rs
Replaced AUTHORIZED_HOTKEY constant with Bittensor and consensus config fields (netuid, stake threshold, refresh/TTL intervals, consensus threshold); main now initializes and spawns validator whitelist and consensus manager background tasks alongside AppState.
Request Handling & State
src/handlers.rs
Updated submit_batch to enforce whitelist readiness, compute archive hash, perform consensus voting, and conditionally proceed to batch creation and execution; expanded AppState with validator_whitelist and consensus_manager fields.
Dependencies & Build
Cargo.toml, Dockerfile
Added bittensor-rs Git dependency and w3f-bls patch; extended Dockerfile build stage with protobuf-compiler, cmake, clang, mold, and Cargo vendor directory copy.
Documentation
AGENTS.md, src/AGENTS.md
Updated architecture documentation to reflect validator-consensus workflow, whitelist-based authentication, new state and background components, and expanded configuration surface.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Handler
    participant AuthFlow
    participant Whitelist
    participant ConsensusManager
    participant Archive
    participant Executor

    Client->>Handler: POST /submit_batch (archive + auth)
    Handler->>AuthFlow: extract_auth_headers(request)
    AuthFlow-->>Handler: AuthHeaders (hotkey, nonce, sig)
    
    Handler->>Whitelist: Check if whitelist initialized
    alt Whitelist not ready
        Handler-->>Client: 503 Whitelist Not Ready
    else Whitelist ready
        Handler->>AuthFlow: verify_request(auth, nonce_store, whitelist)
        AuthFlow->>Whitelist: is_whitelisted(hotkey)
        Whitelist-->>AuthFlow: true/false
        alt Invalid auth
            AuthFlow-->>Handler: AuthError
            Handler-->>Client: 401 Unauthorized
        else Valid auth
            Handler->>Handler: Compute SHA-256(archive)
            Handler->>ConsensusManager: record_vote(hash, hotkey, archive, required, total)
            
            alt Vote already cast by hotkey
                ConsensusManager-->>Handler: AlreadyVoted {votes, required}
                Handler-->>Client: 202 Accepted (pending)
            else Votes not yet reached
                ConsensusManager-->>Handler: Pending {votes, required}
                Handler-->>Client: 202 Accepted (pending)
            else Consensus reached
                ConsensusManager-->>Handler: Reached {archive_data, ...}
                Handler->>Archive: Extract archive data
                Archive-->>Handler: Extracted content
                Handler->>Executor: Spawn batch task
                Executor->>Executor: Process batch
                Handler-->>Client: 202 Accepted (consensus_reached)
            end
        end
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

The review requires careful analysis of new consensus voting semantics (DashMap concurrency, TTL logic, voting idempotency), validator whitelist Bittensor integration (metagraph sync, retry logic), authentication flow refactoring, submit_batch state machine (multiple consensus outcomes), and orchestration of background tasks. While changes follow consistent patterns across files, the heterogeneity of new logic (consensus, whitelist, auth, handlers, config) and interdependencies between modules demand separate reasoning for each component.

Poem

🐰 Validators gather 'round, their votes now sing,
A whitelist fresh from Bittensor's blockchain ring,
No fixed key chains us down, consensus reigns supreme,
From pending votes to "Reached"—a distributed dream!
The archive hashed, the batch at last takes flight, 🚀

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely summarizes the main change: replacing static authentication with dynamic Bittensor validator whitelisting and consensus voting, which aligns with all significant modifications across multiple modules.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/bittensor-validator-whitelist-consensus

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🧹 Nitpick comments (8)
src/main.rs (1)

42-42: Hardcoded max_pending = 100 — consider making it configurable.

Other consensus/whitelist parameters are already configurable via Config and environment variables. The max_pending cap for ConsensusManager is hardcoded here, making it the odd one out. Consider adding a consensus_max_pending field to Config for consistency.

♻️ Suggested change (main.rs)
-    let consensus_manager = consensus::ConsensusManager::new(100);
+    let consensus_manager = consensus::ConsensusManager::new(config.consensus_max_pending);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/main.rs` at line 42, The ConsensusManager is being constructed with a
hardcoded max_pending (ConsensusManager::new(100)); make this value configurable
by adding a consensus_max_pending field to the existing Config and wiring it
through where ConsensusManager is created: add consensus_max_pending to Config,
load it from the same env/config source used by other fields, and replace the
literal 100 in the call to ConsensusManager::new(...) with
config.consensus_max_pending so the cap is consistent with other configurable
consensus/whitelist parameters.
src/auth.rs (2)

72-95: Whitelist check before SS58 validation — good security posture, but note the pre-existing TOCTOU in NonceStore.

The ordering (whitelist → SS58 → nonce → signature) is correct: it avoids leaking format-validity information to unauthorized callers and fails fast.

However, NonceStore::check_and_insert (lines 22-27) has a pre-existing TOCTOU race — contains_key + insert on DashMap is not atomic. Two concurrent requests with the same nonce could both pass the check. Consider using DashMap::entry for atomic insert-if-absent in a follow-up.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/auth.rs` around lines 72 - 95, NonceStore::check_and_insert currently
does a non-atomic contains_key + insert on the DashMap which creates a TOCTOU
race allowing two concurrent requests with the same nonce to both succeed;
change the implementation of NonceStore::check_and_insert to use DashMap::entry
(or equivalent atomic “insert-if-absent” API) so the check-and-insert is
performed atomically and returns false if the entry already exists, ensuring
only the first inserter succeeds.

254-265: Missing test: verify_request with a whitelisted hotkey.

The test covers the rejection path (non-whitelisted hotkey), but there's no test that exercises the full happy path of verify_request with a hotkey that IS in the whitelist (which would then proceed to SS58 validation, nonce check, and signature verification). Consider adding a roundtrip test that inserts a key into the whitelist and calls verify_request.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/auth.rs` around lines 254 - 265, Add a positive unit test that exercises
verify_request with a whitelisted hotkey: create a NonceStore and a
ValidatorWhitelist, insert a valid SS58 hotkey into the whitelist (using the
same format the code expects), create an AuthHeaders with that hotkey, a
matching nonce present in NonceStore, and a signature that will pass
verification (either by generating a real signature with the project’s signing
helper or by using the existing test helper/mocking used elsewhere); call
verify_request(&auth, &store, &wl) and assert it returns Ok (and any expected
AuthResult fields). Ensure the test name reflects the happy path (e.g.,
test_verify_request_whitelisted) and reuse symbols verify_request,
ValidatorWhitelist, NonceStore, and AuthHeaders so the test mirrors the
rejection test structure.
src/validator_whitelist.rs (1)

86-120: Tests cover sync methods well; consider adding a test for refresh_once failure path.

The existing unit tests validate the basic data structure operations. Since try_refresh depends on external RPC, a full integration test isn't expected here, but you could add a test that directly writes to hotkeys and then verifies that a failed refresh_once preserves the cached whitelist (the "keeps cached whitelist on failure" behavior documented in AGENTS.md).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/validator_whitelist.rs` around lines 86 - 120, Add a unit test that
ensures ValidatorWhitelist keeps its cached hotkeys when refresh_once fails:
create a ValidatorWhitelist via ValidatorWhitelist::new(), populate
wl.hotkeys.write().insert(...) with a known hotkey, invoke wl.refresh_once(...)
in a way that causes it to return an error (simulate RPC failure or call the
method path that returns Err), assert the call returns Err, and finally assert
wl.validator_count() and wl.is_whitelisted(...) still reflect the original
cached entry; reference ValidatorWhitelist::refresh_once,
ValidatorWhitelist::new, wl.hotkeys, validator_count, and is_whitelisted to
locate the code to test.
Cargo.toml (1)

65-72: Pin git dependencies to a specific rev for reproducible builds.

Both bittensor-rs and the w3f-bls patch reference a branch (main / fix-no-std). While Cargo.lock captures the resolved commit, any cargo update will silently pull in whatever HEAD is at that moment, risking breakage or supply-chain issues. Pin to an explicit commit:

♻️ Suggested change
-bittensor-rs = { git = "https://github.com/cortexlm/bittensor-rs", branch = "main" }
+bittensor-rs = { git = "https://github.com/cortexlm/bittensor-rs", rev = "<pin-commit-sha>" }

 [patch.crates-io]
-w3f-bls = { git = "https://github.com/opentensor/bls", branch = "fix-no-std" }
+w3f-bls = { git = "https://github.com/opentensor/bls", rev = "<pin-commit-sha>" }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Cargo.toml` around lines 65 - 72, The git dependencies are pinned to branches
which risks non-reproducible builds; update the Cargo.toml entries for the
bittensor-rs dependency and the w3f-bls patch to use explicit commit SHAs by
replacing the branch = "..." keys with rev = "<commit-sha>" for the specific
commits you want to lock to (keep the git URLs and any other keys intact),
ensuring both the bittensor-rs and w3f-bls entries reference concrete commit
SHAs rather than branches.
src/consensus.rs (3)

104-106: is_at_capacity is checked before record_vote in the handler, but there's no atomic guard.

If is_at_capacity() returns false and multiple threads pass this check simultaneously, they could all insert, exceeding max_pending. This is a soft limit and the overshoot is bounded by the number of concurrent requests, so it's acceptable as a best-effort guard. Just noting it isn't a strict cap.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/consensus.rs` around lines 104 - 106, is_at_capacity currently only reads
self.pending.len() so concurrent handlers can all observe false and then call
record_vote, causing bounded overshoot; make the capacity check and insertion
atomic by moving the check into record_vote (or protecting both with the same
lock) so you verify pending.len() < max_pending while holding the mutex that
guards pending (or use the map's entry API to conditionally insert only if
vacant and size still < max_pending). Update record_vote to perform the length
check under the pending lock (or use pending.entry/vacant check) and return a
clear capacity error if the insertion would exceed max_pending; keep
is_at_capacity as a best-effort helper if needed but do not rely on it for
correctness.

133-231: Tests cover the core scenarios well, but consider adding a concurrent voting test.

The existing tests validate single vote, threshold, duplicates, independent hashes, TTL, and capacity. Given the race condition identified above, an additional multi-threaded test using std::thread::scope or tokio::spawn with multiple voters racing on the same hash would help validate the fix.

Also, per coding guidelines, async tests should use #[tokio::test] — the reaper_loop is async but not tested here. A test that inserts an expired entry, advances time (e.g., via tokio::time::pause()), and verifies reaping would strengthen coverage.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/consensus.rs` around lines 133 - 231, Add a concurrent voting test that
races multiple callers of ConsensusManager::record_vote on the same hash to
ensure no double-counting (use std::thread::scope or tokio::spawn with multiple
voters and assert final ConsensusStatus or pending_count), and add an async
#[tokio::test] that exercises reaper_loop by inserting an expired entry into
ConsensusManager::pending, calling tokio::time::pause()/advance to simulate time
passage, running reaper_loop (or invoking the reap logic) and asserting the
expired entry is removed; reference ConsensusManager, record_vote, reaper_loop,
pending, pending_count, and is_at_capacity when implementing these tests.

7-12: PendingConsensus stores a full copy of the archive per hash entry.

This is fine for correctness since only the first voter's data is kept (subsequent voters' archive_data is dropped by or_insert_with). However, note that every pending consensus entry holds the full archive bytes in memory. With max_pending = 100 and potentially large archives, this could consume significant memory. Consider whether storing only the hash and retrieving the archive from the voter that triggers consensus would be more memory-efficient — though that would be a larger design change and is acceptable for now.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/consensus.rs` around lines 7 - 12, PendingConsensus currently stores a
full Vec<u8> per entry (archive_data), which can blow up memory; replace
archive_data with a shared pointer (e.g., Arc<Vec<u8>>) and ensure wherever a
PendingConsensus is created (the or_insert_with site and any constructors) you
wrap the archive bytes in Arc::new(...) and clone the Arc when inserting so
multiple entries share the same buffer instead of copying it; alternatively, if
you prefer the larger design change, store only the archive hash in
PendingConsensus and fetch the full archive from the voter that triggers
consensus, but for a minimal change use Arc<Vec<u8>> and update all uses of
archive_data accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@AGENTS.md`:
- Around line 37-47: The fenced code block describing ValidatorWhitelist refresh
loop and ConsensusManager reaper loop lacks a language specifier (MD040); update
the block to include a plain-text language tag (e.g., use ```text) so the static
analyzer stops flagging it; modify the section containing "ValidatorWhitelist
refresh loop (every 5 minutes):" and "ConsensusManager reaper loop (every 30
seconds):" (references: ValidatorWhitelist, ConsensusManager,
BittensorClient::with_failover, metagraph) to open the fence with ```text and
close as before.

In `@src/AGENTS.md`:
- Around line 54-60: The doc line about duplicate votes is slightly misleading;
update AGENTS.md (consensus.rs section) to state that duplicate votes are not
silently ignored but cause record_vote() to return ConsensusStatus::AlreadyVoted
along with current vote counts, so callers can surface that to clients; mention
ConsensusManager and record_vote() by name and the ConsensusStatus::AlreadyVoted
variant to make the behavior explicit.
- Line 67: Update the submit_batch handler documentation to include the busy
check: modify the flow on line 67 to show auth header extraction → whitelist
empty check (503) → busy check (503) — call to has_active_batch() — →
verify_request (whitelist + nonce + signature) → multipart upload → SHA-256 hash
→ consensus vote → if pending: return 202 with vote count → if reached: archive
extraction → batch creation → executor spawn; reference the submit_batch handler
and the has_active_batch() check so the doc matches runtime behavior.

In `@src/config.rs`:
- Around line 13-17: The CONSENSUS_THRESHOLD value (parsed as f64) has no bounds
check; update the from_env logic that reads CONSENSUS_THRESHOLD so it validates
the parsed value is within a safe range (e.g., > 0.0 and <= 1.0) and handle
out-of-range values by either clamping to DEFAULT_CONSENSUS_THRESHOLD or
returning a clear error/result and logging a warning; reference
DEFAULT_CONSENSUS_THRESHOLD and the CONSENSUS_THRESHOLD env parse in from_env to
locate and change the code.

In `@src/consensus.rs`:
- Around line 79-96: The race comes from dropping the DashMap RefMut (`entry`)
before calling `self.pending.remove(archive_hash)`, allowing another thread to
remove the same key and leaving this thread to return Pending despite consensus
being reached; fix by capturing the map key via `entry.key().clone()` before
dropping, then perform an atomic conditional remove using DashMap's conditional
removal API (e.g., `remove_if`) that validates the stored vote count/consensus
threshold inside the closure (check `votes >= required`) so removal only
succeeds when the map's current state still meets the consensus condition;
update the code paths around `entry`, `self.pending.remove` and the
consensus-return branch (functions/idents: `entry`, `entry.key()`,
`self.pending`, `remove_if`, `ConsensusStatus::Reached`/`Pending`) to use this
pattern.

In `@src/handlers.rs`:
- Around line 189-193: The current synchronous SHA-256 computation using Sha256
on archive_bytes (producing archive_hash) is CPU-bound and runs on the Tokio
worker thread; move that work into a blocking task by using
tokio::task::spawn_blocking to compute the hash so it doesn't block the async
runtime, and avoid clone() by moving archive_bytes into the spawn_blocking
closure and returning both the computed hex hash and the archive_bytes for later
use in record_vote; update the code that awaits spawn_blocking to receive
(archive_hash, archive_bytes) and use those values downstream.
- Around line 243-261: The handler currently calls record_vote() which can
return ConsensusStatus::Reached and remove the pending consensus entry before
checking state.sessions.has_active_batch(), causing consumed votes if the server
is busy; change the flow in the handler to check has_active_batch() first and
return the 503 without touching consensus, or if you must call record_vote()
before the busy-check, capture and persist the consensus state (the data inside
ConsensusStatus::Reached — e.g., archive_data, concurrent_tasks, votes,
required) before record_vote() removes it so you can re-insert it into the
pending map on the busy path; update the code paths around record_vote(),
ConsensusStatus::Reached, and the pending map manipulation in consensus.rs to
ensure validators can re-submit and re-reach consensus when the handler returns
SERVICE_UNAVAILABLE.

In `@src/validator_whitelist.rs`:
- Around line 62-83: The try_refresh function calls
BittensorClient::with_failover() and bittensor_rs::sync_metagraph() without
timeouts; wrap these async calls with tokio::time::timeout (e.g., a configurable
Duration) so a hanging RPC will return an error instead of blocking the refresh
loop, and map timeout errors into the existing anyhow::Error flow (use
anyhow::anyhow!("Timed out ...") or similar). Specifically, apply timeout to the
with_failover() await and to the sync_metagraph(&client, netuid).await calls,
handle the Result from timeout (map Err(elapsed) -> anyhow error), and keep the
rest of try_refresh (new_hotkeys population and hotkeys write) unchanged.
- Around line 62-77: The dependency for bittensor-rs is currently tracked by
branch="main" which can introduce breaking API changes affecting
BittensorClient::with_failover, sync_metagraph, and neuron field access; update
Cargo.toml to replace the unpinned branch with a specific commit revision (rev =
"<commit-hash>", e.g. eb58916af5a4d7fef74ef00ea0d61519880b101f) so builds are
reproducible and APIs used in try_refresh (BittensorClient::with_failover,
sync_metagraph, neuron.validator_permit, neuron.active, neuron.stake) remain
stable. Ensure you run cargo update -p bittensor-rs if needed and check
Cargo.lock reflects the pinned rev.

---

Nitpick comments:
In `@Cargo.toml`:
- Around line 65-72: The git dependencies are pinned to branches which risks
non-reproducible builds; update the Cargo.toml entries for the bittensor-rs
dependency and the w3f-bls patch to use explicit commit SHAs by replacing the
branch = "..." keys with rev = "<commit-sha>" for the specific commits you want
to lock to (keep the git URLs and any other keys intact), ensuring both the
bittensor-rs and w3f-bls entries reference concrete commit SHAs rather than
branches.

In `@src/auth.rs`:
- Around line 72-95: NonceStore::check_and_insert currently does a non-atomic
contains_key + insert on the DashMap which creates a TOCTOU race allowing two
concurrent requests with the same nonce to both succeed; change the
implementation of NonceStore::check_and_insert to use DashMap::entry (or
equivalent atomic “insert-if-absent” API) so the check-and-insert is performed
atomically and returns false if the entry already exists, ensuring only the
first inserter succeeds.
- Around line 254-265: Add a positive unit test that exercises verify_request
with a whitelisted hotkey: create a NonceStore and a ValidatorWhitelist, insert
a valid SS58 hotkey into the whitelist (using the same format the code expects),
create an AuthHeaders with that hotkey, a matching nonce present in NonceStore,
and a signature that will pass verification (either by generating a real
signature with the project’s signing helper or by using the existing test
helper/mocking used elsewhere); call verify_request(&auth, &store, &wl) and
assert it returns Ok (and any expected AuthResult fields). Ensure the test name
reflects the happy path (e.g., test_verify_request_whitelisted) and reuse
symbols verify_request, ValidatorWhitelist, NonceStore, and AuthHeaders so the
test mirrors the rejection test structure.

In `@src/consensus.rs`:
- Around line 104-106: is_at_capacity currently only reads self.pending.len() so
concurrent handlers can all observe false and then call record_vote, causing
bounded overshoot; make the capacity check and insertion atomic by moving the
check into record_vote (or protecting both with the same lock) so you verify
pending.len() < max_pending while holding the mutex that guards pending (or use
the map's entry API to conditionally insert only if vacant and size still <
max_pending). Update record_vote to perform the length check under the pending
lock (or use pending.entry/vacant check) and return a clear capacity error if
the insertion would exceed max_pending; keep is_at_capacity as a best-effort
helper if needed but do not rely on it for correctness.
- Around line 133-231: Add a concurrent voting test that races multiple callers
of ConsensusManager::record_vote on the same hash to ensure no double-counting
(use std::thread::scope or tokio::spawn with multiple voters and assert final
ConsensusStatus or pending_count), and add an async #[tokio::test] that
exercises reaper_loop by inserting an expired entry into
ConsensusManager::pending, calling tokio::time::pause()/advance to simulate time
passage, running reaper_loop (or invoking the reap logic) and asserting the
expired entry is removed; reference ConsensusManager, record_vote, reaper_loop,
pending, pending_count, and is_at_capacity when implementing these tests.
- Around line 7-12: PendingConsensus currently stores a full Vec<u8> per entry
(archive_data), which can blow up memory; replace archive_data with a shared
pointer (e.g., Arc<Vec<u8>>) and ensure wherever a PendingConsensus is created
(the or_insert_with site and any constructors) you wrap the archive bytes in
Arc::new(...) and clone the Arc when inserting so multiple entries share the
same buffer instead of copying it; alternatively, if you prefer the larger
design change, store only the archive hash in PendingConsensus and fetch the
full archive from the voter that triggers consensus, but for a minimal change
use Arc<Vec<u8>> and update all uses of archive_data accordingly.

In `@src/main.rs`:
- Line 42: The ConsensusManager is being constructed with a hardcoded
max_pending (ConsensusManager::new(100)); make this value configurable by adding
a consensus_max_pending field to the existing Config and wiring it through where
ConsensusManager is created: add consensus_max_pending to Config, load it from
the same env/config source used by other fields, and replace the literal 100 in
the call to ConsensusManager::new(...) with config.consensus_max_pending so the
cap is consistent with other configurable consensus/whitelist parameters.

In `@src/validator_whitelist.rs`:
- Around line 86-120: Add a unit test that ensures ValidatorWhitelist keeps its
cached hotkeys when refresh_once fails: create a ValidatorWhitelist via
ValidatorWhitelist::new(), populate wl.hotkeys.write().insert(...) with a known
hotkey, invoke wl.refresh_once(...) in a way that causes it to return an error
(simulate RPC failure or call the method path that returns Err), assert the call
returns Err, and finally assert wl.validator_count() and wl.is_whitelisted(...)
still reflect the original cached entry; reference
ValidatorWhitelist::refresh_once, ValidatorWhitelist::new, wl.hotkeys,
validator_count, and is_whitelisted to locate the code to test.

Comment on lines +37 to 47
```
ValidatorWhitelist refresh loop (every 5 minutes):
1. Connect to Bittensor subtensor via BittensorClient::with_failover()
2. Sync metagraph for netuid 100
3. Filter validators: validator_permit && active && stake >= 10,000 TAO
4. Atomically replace whitelist with new set of SS58 hotkeys
5. On failure: retry up to 3 times with exponential backoff, keep cached whitelist

ConsensusManager reaper loop (every 30 seconds):
1. Remove pending consensus entries older than TTL (default 60s)
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add language specifier to fenced code block.

The static analysis tool flags this code block as missing a language identifier (MD040). Since it's a plain-text description rather than code, use text or plaintext:

♻️ Suggested fix
-```
+```text
 ValidatorWhitelist refresh loop (every 5 minutes):
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
```
ValidatorWhitelist refresh loop (every 5 minutes):
1. Connect to Bittensor subtensor via BittensorClient::with_failover()
2. Sync metagraph for netuid 100
3. Filter validators: validator_permit && active && stake >= 10,000 TAO
4. Atomically replace whitelist with new set of SS58 hotkeys
5. On failure: retry up to 3 times with exponential backoff, keep cached whitelist
ConsensusManager reaper loop (every 30 seconds):
1. Remove pending consensus entries older than TTL (default 60s)
```
🧰 Tools
🪛 markdownlint-cli2 (0.21.0)

[warning] 37-37: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@AGENTS.md` around lines 37 - 47, The fenced code block describing
ValidatorWhitelist refresh loop and ConsensusManager reaper loop lacks a
language specifier (MD040); update the block to include a plain-text language
tag (e.g., use ```text) so the static analyzer stops flagging it; modify the
section containing "ValidatorWhitelist refresh loop (every 5 minutes):" and
"ConsensusManager reaper loop (every 30 seconds):" (references:
ValidatorWhitelist, ConsensusManager, BittensorClient::with_failover, metagraph)
to open the fence with ```text and close as before.

Comment on lines +54 to +60
### `consensus.rs`
- `ConsensusManager` — `DashMap<String, PendingConsensus>` keyed by SHA-256 hex hash of archive bytes.
- `PendingConsensus` — holds archive data, voter set, creation time, concurrent_tasks setting.
- `record_vote()` — adds a validator's vote for an archive hash; returns `ConsensusStatus` (Pending, Reached, AlreadyVoted).
- `is_at_capacity()` — checks if max pending entries reached (prevents memory exhaustion).
- `reaper_loop(ttl_secs)` — background task that removes expired entries every 30 seconds.
- **Convention**: Consensus entries have a 60-second TTL. Max 100 pending entries. Duplicate votes from the same validator are silently acknowledged.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Documentation for consensus.rs accurately reflects the implementation.

One minor note: Line 60 states "Duplicate votes from the same validator are silently acknowledged" — this is slightly imprecise. The code returns ConsensusStatus::AlreadyVoted with vote counts, which the handler presumably surfaces to the caller rather than silently ignoring. Consider rewording to "Duplicate votes return an AlreadyVoted status with current vote counts."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/AGENTS.md` around lines 54 - 60, The doc line about duplicate votes is
slightly misleading; update AGENTS.md (consensus.rs section) to state that
duplicate votes are not silently ignored but cause record_vote() to return
ConsensusStatus::AlreadyVoted along with current vote counts, so callers can
surface that to clients; mention ConsensusManager and record_vote() by name and
the ConsensusStatus::AlreadyVoted variant to make the behavior explicit.

- Route handlers: `health`, `status`, `metrics`, `submit_batch`, `get_batch`, `get_batch_tasks`, `get_task`, `list_batches`.
- Routes: `GET /health`, `GET /status`, `GET /metrics`, `POST /submit`, `GET /batch/{id}`, `GET /batch/{id}/tasks`, `GET /batch/{id}/task/{task_id}`, `GET /batches`, `GET /ws`.
- `submit_batch` handler does: auth header extraction → `verify_request` (hotkey + API key + nonce + signature) → busy checkmultipart upload → archive extraction → batch creation → executor spawn.
- `submit_batch` handler does: auth header extraction → whitelist empty check (503) → `verify_request` (whitelist + nonce + signature) → multipart uploadSHA-256 hash → consensus vote → if pending: return 202 with vote count → if reached: archive extraction → batch creation → executor spawn.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

rg -n 'has_active_batch' src/handlers.rs

Repository: PlatformNetwork/term-executor

Length of output: 222


🏁 Script executed:

sed -n '60,85p' src/handlers.rs

Repository: PlatformNetwork/term-executor

Length of output: 1176


🏁 Script executed:

sed -n '250,260p' src/handlers.rs

Repository: PlatformNetwork/term-executor

Length of output: 546


Add the busy check to the documented submit_batch flow on line 67.

The has_active_batch() check is still enforced in the handler (returning 503 if a batch is running). The documented flow should include this step: auth header extraction → whitelist empty check (503) → busy check (503)verify_request (whitelist + nonce + signature) → multipart upload → ... The busy check occurs before multipart processing and prevents concurrent batch submissions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/AGENTS.md` at line 67, Update the submit_batch handler documentation to
include the busy check: modify the flow on line 67 to show auth header
extraction → whitelist empty check (503) → busy check (503) — call to
has_active_batch() — → verify_request (whitelist + nonce + signature) →
multipart upload → SHA-256 hash → consensus vote → if pending: return 202 with
vote count → if reached: archive extraction → batch creation → executor spawn;
reference the submit_batch handler and the has_active_batch() check so the doc
matches runtime behavior.

Comment on lines +13 to +17
const DEFAULT_BITTENSOR_NETUID: u16 = 100;
const DEFAULT_MIN_VALIDATOR_STAKE_TAO: f64 = 10_000.0;
const DEFAULT_VALIDATOR_REFRESH_SECS: u64 = 300;
const DEFAULT_CONSENSUS_THRESHOLD: f64 = 0.5;
const DEFAULT_CONSENSUS_TTL_SECS: u64 = 60;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

No bounds validation on consensus_threshold.

CONSENSUS_THRESHOLD is parsed as f64 with no range check. A value of 0.0 makes every single vote trigger consensus immediately (security bypass), and a value > 1.0 makes consensus permanently unreachable. Consider clamping or validating in from_env:

🛡️ Suggested validation
+            consensus_threshold: {
+                let t = env_parse("CONSENSUS_THRESHOLD", DEFAULT_CONSENSUS_THRESHOLD);
+                t.clamp(0.01, 1.0)
+            },
-            consensus_threshold: env_parse("CONSENSUS_THRESHOLD", DEFAULT_CONSENSUS_THRESHOLD),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/config.rs` around lines 13 - 17, The CONSENSUS_THRESHOLD value (parsed as
f64) has no bounds check; update the from_env logic that reads
CONSENSUS_THRESHOLD so it validates the parsed value is within a safe range
(e.g., > 0.0 and <= 1.0) and handle out-of-range values by either clamping to
DEFAULT_CONSENSUS_THRESHOLD or returning a clear error/result and logging a
warning; reference DEFAULT_CONSENSUS_THRESHOLD and the CONSENSUS_THRESHOLD env
parse in from_env to locate and change the code.

Comment on lines +79 to +96
if votes >= required {
drop(entry);
if let Some((_, consensus)) = self.pending.remove(archive_hash) {
info!(archive_hash, votes, required, "Consensus reached");
return ConsensusStatus::Reached {
archive_data: consensus.archive_data,
concurrent_tasks: consensus.concurrent_tasks,
votes,
required,
};
}
}

ConsensusStatus::Pending {
votes,
required,
total_validators,
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

rg -n 'dashmap' Cargo.toml

Repository: PlatformNetwork/term-executor

Length of output: 89


🏁 Script executed:

cat -n src/consensus.rs | sed -n '70,110p'

Repository: PlatformNetwork/term-executor

Length of output: 1525


🏁 Script executed:

fd -t f consensus.rs

Repository: PlatformNetwork/term-executor

Length of output: 89


🏁 Script executed:

cat -n src/consensus.rs | sed -n '40,100p'

Repository: PlatformNetwork/term-executor

Length of output: 2255


🏁 Script executed:

rg -n 'fn record_vote' src/consensus.rs -A 30

Repository: PlatformNetwork/term-executor

Length of output: 1113


🏁 Script executed:

rg -n '#\[test\]|#\[tokio::test\]' src/consensus.rs | head -20

Repository: PlatformNetwork/term-executor

Length of output: 168


🏁 Script executed:

rg -n 'remove_if' src/

Repository: PlatformNetwork/term-executor

Length of output: 55


🏁 Script executed:

cat -n src/consensus.rs | sed -n '137,220p'

Repository: PlatformNetwork/term-executor

Length of output: 3512


🏁 Script executed:

web_search
dashmap 6 remove_if method availability

Repository: PlatformNetwork/term-executor

Length of output: 169


🌐 Web query:

dashmap 6 remove_if method documentation

💡 Result:

In dashmap v6 (DashMap), remove_if is documented on the DashMap type as: (docs.rs)

pub fn remove_if<Q>(
    &self,
    key: &Q,
    f: impl FnOnce(&K, &V) -> bool,
) -> Option<(K, V)>
where
    K: Borrow<Q>,
    Q: Hash + Eq + ?Sized,

Behavior: Removes the entry for key and returns Some((K, V)) only if the key exists and the predicate f(&K, &V) returns true; otherwise returns None. (docs.rs)

Locking / safety note: The docs warn it may deadlock if called while holding any kind of reference into the map. (docs.rs)

Citations:


🏁 Script executed:

rg -n 'entry\(.*\)' src/consensus.rs | head -10

Repository: PlatformNetwork/term-executor

Length of output: 120


🏁 Script executed:

rg -n 'use std::sync::Arc' src/consensus.rs

Repository: PlatformNetwork/term-executor

Length of output: 94


🏁 Script executed:

cat -n src/consensus.rs | sed -n '1,30p'

Repository: PlatformNetwork/term-executor

Length of output: 941


🏁 Script executed:

rg -n 'tokio::spawn\|thread::spawn' src/consensus.rs

Repository: PlatformNetwork/term-executor

Length of output: 55


🏁 Script executed:

rg -n 'pub.*fn.*key' src/consensus.rs

Repository: PlatformNetwork/term-executor

Length of output: 55


🏁 Script executed:

rg -n 'RefMut' src/consensus.rs

Repository: PlatformNetwork/term-executor

Length of output: 55


🏁 Script executed:

python3 << 'EOF'
# Check DashMap RefMut API in dashmap 6
# The question is whether RefMut::key() exists

# From the code, entry is created via:
# let mut entry = self.pending.entry(archive_hash.to_string()).or_insert_with(...)
# This returns a RefMut

# The proposed fix calls entry.key() but RefMut likely doesn't have key()
# We can check by looking at how the entry is accessed

print("DashMap RefMut behavior:")
print("- RefMut provides mutable access to the VALUE")
print("- The key is NOT accessible from RefMut after entry() creates it")
print("- Would need to call entry.key() before mut borrow, or store key separately")
EOF

Repository: PlatformNetwork/term-executor

Length of output: 284


🏁 Script executed:

cat -n src/consensus.rs | sed -n '46,97p'

Repository: PlatformNetwork/term-executor

Length of output: 2031


🌐 Web query:

dashmap RefMut key() method

💡 Result:

In DashMap, dashmap::mapref::one::RefMut<'a, K, V>::key() returns an immutable reference to the key for the entry you’re holding:

  • Signature: pub fn key(&self) -> &K  [1]
  • You can call it on a RefMut you get from e.g. get_mut, entry(...).or_insert(...), etc. (it does not require &mut self). [1][2]

Example:

use dashmap::DashMap;

let m: DashMap<String, i32> = DashMap::new();
m.insert("a".to_string(), 10);

if let Some(mut r) = m.get_mut("a") {
    let k: &String = r.key();
    *r.value_mut() += 1;
    println!("{k} -> {}", *r);
}

Notes:

  • The returned &K is valid only as long as the RefMut guard is alive (it’s tied to the same borrow/lock scope). [1]

Sources:
[1] DashMap RefMut docs (methods incl. key) https://docs.rs/dashmap/latest/dashmap/mapref/one/struct.RefMut.html
[2] DashMap Entry docs (ways you may obtain a RefMut, e.g. or_default) https://docs.rs/dashmap/latest/dashmap/mapref/entry/enum.Entry.html


🏁 Script executed:

rg -n 'pub fn key' src/consensus.rs

Repository: PlatformNetwork/term-executor

Length of output: 55


🏁 Script executed:

cat -n src/consensus.rs | sed -n '46,67p' > /tmp/entry_code.txt && python3 << 'EOF'
# Analyze the DashMap entry API usage
with open('/tmp/entry_code.txt') as f:
    content = f.read()
    print("Entry usage analysis:")
    print(content)
    print("\n---\nKey observation:")
    print("entry is RefMut<'_, K, V>")
    print("RefMut provides: &V, &mut V via Deref/DerefMut")
    print("RefMut does NOT have key() method in standard DashMap API")
    print("To access key, need to call before mut borrow or use entry.key()")
    print("Actually, let's check: DashMap::Entry API...")
EOF

Repository: PlatformNetwork/term-executor

Length of output: 1188


Race condition between drop(entry) and self.pending.remove() is real and critical.

After drop(entry) releases the DashMap shard lock at line 80, a concurrent thread executing record_vote on the same archive_hash can race to call remove() after observing votes >= required. Both threads may reach the removal step, but only the first succeeds. The second thread falls through to return ConsensusStatus::Pending with votes >= required, creating a contradictory state where consensus has been reached and the entry removed, yet the caller receives Pending. This can cause unnecessary retries and logic errors in the caller.

The proposed fix's explanation is incomplete. Draining archive_data does not prevent the race condition — the race occurs on the votes count and the remove() operation itself, not on data contents. However, the technical approach of calling entry.key() before drop() is correct (RefMut provides key(&self) -> &K in DashMap 6). A more robust solution requires atomicity: either hold the RefMut through the removal decision (if DashMap's API permits), use remove_if with a closure validating the vote threshold (though the deadlock warning in DashMap's docs must be heeded), or introduce additional synchronization to serialize the consensus-reached path. The learnings for this codebase emphasize proper synchronization for shared state — this code needs strengthening to uphold that standard.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/consensus.rs` around lines 79 - 96, The race comes from dropping the
DashMap RefMut (`entry`) before calling `self.pending.remove(archive_hash)`,
allowing another thread to remove the same key and leaving this thread to return
Pending despite consensus being reached; fix by capturing the map key via
`entry.key().clone()` before dropping, then perform an atomic conditional remove
using DashMap's conditional removal API (e.g., `remove_if`) that validates the
stored vote count/consensus threshold inside the closure (check `votes >=
required`) so removal only succeeds when the map's current state still meets the
consensus condition; update the code paths around `entry`, `self.pending.remove`
and the consensus-return branch (functions/idents: `entry`, `entry.key()`,
`self.pending`, `remove_if`, `ConsensusStatus::Reached`/`Pending`) to use this
pattern.

Comment on lines +189 to +193
let archive_hash = {
let mut hasher = Sha256::new();
hasher.update(&archive_bytes);
hex::encode(hasher.finalize())
};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

SHA-256 of up to 500 MB archive blocks the async runtime — use spawn_blocking.

Sha256 hashing up to max_archive_bytes (500 MB) is CPU-bound work running directly on the Tokio worker thread. This will starve other request handlers during the hash computation.

🐛 Proposed fix
-    let archive_hash = {
-        let mut hasher = Sha256::new();
-        hasher.update(&archive_bytes);
-        hex::encode(hasher.finalize())
-    };
+    let archive_hash = {
+        let bytes = archive_bytes.clone();
+        tokio::task::spawn_blocking(move || {
+            let mut hasher = Sha256::new();
+            hasher.update(&bytes);
+            hex::encode(hasher.finalize())
+        })
+        .await
+        .map_err(|e| (
+            StatusCode::INTERNAL_SERVER_ERROR,
+            Json(serde_json::json!({
+                "error": "hash_failed",
+                "message": format!("Failed to compute archive hash: {}", e)
+            })),
+        ))?
+    };

Note: Ideally avoid the clone() by moving archive_bytes into the blocking task and returning it alongside the hash, since it's needed later for record_vote. As per coding guidelines: "Do NOT add synchronous blocking I/O in async functions — use tokio::task::spawn_blocking for CPU-heavy work."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/handlers.rs` around lines 189 - 193, The current synchronous SHA-256
computation using Sha256 on archive_bytes (producing archive_hash) is CPU-bound
and runs on the Tokio worker thread; move that work into a blocking task by
using tokio::task::spawn_blocking to compute the hash so it doesn't block the
async runtime, and avoid clone() by moving archive_bytes into the spawn_blocking
closure and returning both the computed hex hash and the archive_bytes for later
use in record_vote; update the code that awaits spawn_blocking to receive
(archive_hash, archive_bytes) and use those values downstream.

Comment on lines +243 to +261
ConsensusStatus::Reached {
archive_data,
concurrent_tasks,
votes,
required,
} => {
let effective_concurrent = concurrent_tasks
.unwrap_or(state.config.max_concurrent_tasks)
.min(state.config.max_concurrent_tasks);

if state.sessions.has_active_batch() {
return Err((
StatusCode::SERVICE_UNAVAILABLE,
Json(serde_json::json!({
"error": "busy",
"message": "A batch is already running. Wait for it to complete."
})),
));
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

fd -t f "\.rs$" | head -20

Repository: PlatformNetwork/term-executor

Length of output: 253


🏁 Script executed:

git ls-files "*.rs" | grep -E "(consensus|manager)" | head -20

Repository: PlatformNetwork/term-executor

Length of output: 89


🏁 Script executed:

rg "fn record_vote" --type rs -B 2 -A 15

Repository: PlatformNetwork/term-executor

Length of output: 99


🏁 Script executed:

rg "ConsensusManager" --type rs -l | head -10

Repository: PlatformNetwork/term-executor

Length of output: 99


🏁 Script executed:

rg "fn record_vote" -A 20

Repository: PlatformNetwork/term-executor

Length of output: 1104


🏁 Script executed:

cat -n src/consensus.rs | head -100

Repository: PlatformNetwork/term-executor

Length of output: 3369


🏁 Script executed:

cat -n src/handlers.rs | sed -n '230,280p'

Repository: PlatformNetwork/term-executor

Length of output: 2318


🏁 Script executed:

cat -n src/handlers.rs | sed -n '190,245p'

Repository: PlatformNetwork/term-executor

Length of output: 2197


🏁 Script executed:

rg "record_vote" src/handlers.rs -B 5 -A 30

Repository: PlatformNetwork/term-executor

Length of output: 1097


Consensus entry removed before busy check → validators must re-reach consensus if server is busy.

When record_vote() returns ConsensusStatus::Reached, the consensus entry is already removed from the pending map (line 81 in src/consensus.rs). If has_active_batch() returns true on line 253, the handler returns 503, but the consensus votes are already consumed. Validators must re-submit and re-reach consensus.

Suggested fixes:

  1. Check has_active_batch() before calling record_vote() to prevent consensus consumption while the server is busy.
  2. Store consensus state separately before removal, so failed requests can re-insert it for retry.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/handlers.rs` around lines 243 - 261, The handler currently calls
record_vote() which can return ConsensusStatus::Reached and remove the pending
consensus entry before checking state.sessions.has_active_batch(), causing
consumed votes if the server is busy; change the flow in the handler to check
has_active_batch() first and return the 503 without touching consensus, or if
you must call record_vote() before the busy-check, capture and persist the
consensus state (the data inside ConsensusStatus::Reached — e.g., archive_data,
concurrent_tasks, votes, required) before record_vote() removes it so you can
re-insert it into the pending map on the busy path; update the code paths around
record_vote(), ConsensusStatus::Reached, and the pending map manipulation in
consensus.rs to ensure validators can re-submit and re-reach consensus when the
handler returns SERVICE_UNAVAILABLE.

Comment on lines +62 to +83
async fn try_refresh(&self, netuid: u16, min_stake_tao: f64) -> anyhow::Result<usize> {
use bittensor_rs::ss58::encode_ss58;

let client = bittensor_rs::BittensorClient::with_failover()
.await
.map_err(|e| anyhow::anyhow!("Failed to connect to subtensor: {}", e))?;

let metagraph = bittensor_rs::sync_metagraph(&client, netuid)
.await
.map_err(|e| anyhow::anyhow!("Failed to sync metagraph: {}", e))?;

let mut new_hotkeys = HashSet::new();
for neuron in metagraph.neurons.values() {
if neuron.validator_permit && neuron.active && neuron.stake.as_tao() >= min_stake_tao {
new_hotkeys.insert(encode_ss58(&neuron.hotkey));
}
}

let count = new_hotkeys.len();
*self.hotkeys.write() = new_hotkeys;
Ok(count)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

No timeout on Bittensor RPC calls — a hanging endpoint will block the refresh loop indefinitely.

BittensorClient::with_failover() and sync_metagraph() are async calls with no timeout guard. If the subtensor RPC hangs, this function will never return, and the refresh loop will stall permanently (the 3-retry logic won't help since the first attempt never completes).

Wrap with tokio::time::timeout:

🛡️ Suggested timeout wrapper
     async fn try_refresh(&self, netuid: u16, min_stake_tao: f64) -> anyhow::Result<usize> {
         use bittensor_rs::ss58::encode_ss58;
+        use tokio::time::timeout;
+        use std::time::Duration;
+
+        const RPC_TIMEOUT: Duration = Duration::from_secs(30);

-        let client = bittensor_rs::BittensorClient::with_failover()
+        let client = timeout(RPC_TIMEOUT, bittensor_rs::BittensorClient::with_failover())
             .await
+            .map_err(|_| anyhow::anyhow!("Subtensor connection timed out"))?
             .map_err(|e| anyhow::anyhow!("Failed to connect to subtensor: {}", e))?;

-        let metagraph = bittensor_rs::sync_metagraph(&client, netuid)
+        let metagraph = timeout(RPC_TIMEOUT, bittensor_rs::sync_metagraph(&client, netuid))
             .await
+            .map_err(|_| anyhow::anyhow!("Metagraph sync timed out"))?
             .map_err(|e| anyhow::anyhow!("Failed to sync metagraph: {}", e))?;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async fn try_refresh(&self, netuid: u16, min_stake_tao: f64) -> anyhow::Result<usize> {
use bittensor_rs::ss58::encode_ss58;
let client = bittensor_rs::BittensorClient::with_failover()
.await
.map_err(|e| anyhow::anyhow!("Failed to connect to subtensor: {}", e))?;
let metagraph = bittensor_rs::sync_metagraph(&client, netuid)
.await
.map_err(|e| anyhow::anyhow!("Failed to sync metagraph: {}", e))?;
let mut new_hotkeys = HashSet::new();
for neuron in metagraph.neurons.values() {
if neuron.validator_permit && neuron.active && neuron.stake.as_tao() >= min_stake_tao {
new_hotkeys.insert(encode_ss58(&neuron.hotkey));
}
}
let count = new_hotkeys.len();
*self.hotkeys.write() = new_hotkeys;
Ok(count)
}
async fn try_refresh(&self, netuid: u16, min_stake_tao: f64) -> anyhow::Result<usize> {
use bittensor_rs::ss58::encode_ss58;
use tokio::time::timeout;
use std::time::Duration;
const RPC_TIMEOUT: Duration = Duration::from_secs(30);
let client = timeout(RPC_TIMEOUT, bittensor_rs::BittensorClient::with_failover())
.await
.map_err(|_| anyhow::anyhow!("Subtensor connection timed out"))?
.map_err(|e| anyhow::anyhow!("Failed to connect to subtensor: {}", e))?;
let metagraph = timeout(RPC_TIMEOUT, bittensor_rs::sync_metagraph(&client, netuid))
.await
.map_err(|_| anyhow::anyhow!("Metagraph sync timed out"))?
.map_err(|e| anyhow::anyhow!("Failed to sync metagraph: {}", e))?;
let mut new_hotkeys = HashSet::new();
for neuron in metagraph.neurons.values() {
if neuron.validator_permit && neuron.active && neuron.stake.as_tao() >= min_stake_tao {
new_hotkeys.insert(encode_ss58(&neuron.hotkey));
}
}
let count = new_hotkeys.len();
*self.hotkeys.write() = new_hotkeys;
Ok(count)
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/validator_whitelist.rs` around lines 62 - 83, The try_refresh function
calls BittensorClient::with_failover() and bittensor_rs::sync_metagraph()
without timeouts; wrap these async calls with tokio::time::timeout (e.g., a
configurable Duration) so a hanging RPC will return an error instead of blocking
the refresh loop, and map timeout errors into the existing anyhow::Error flow
(use anyhow::anyhow!("Timed out ...") or similar). Specifically, apply timeout
to the with_failover() await and to the sync_metagraph(&client, netuid).await
calls, handle the Result from timeout (map Err(elapsed) -> anyhow error), and
keep the rest of try_refresh (new_hotkeys population and hotkeys write)
unchanged.

Comment on lines +62 to +77
async fn try_refresh(&self, netuid: u16, min_stake_tao: f64) -> anyhow::Result<usize> {
use bittensor_rs::ss58::encode_ss58;

let client = bittensor_rs::BittensorClient::with_failover()
.await
.map_err(|e| anyhow::anyhow!("Failed to connect to subtensor: {}", e))?;

let metagraph = bittensor_rs::sync_metagraph(&client, netuid)
.await
.map_err(|e| anyhow::anyhow!("Failed to sync metagraph: {}", e))?;

let mut new_hotkeys = HashSet::new();
for neuron in metagraph.neurons.values() {
if neuron.validator_permit && neuron.active && neuron.stake.as_tao() >= min_stake_tao {
new_hotkeys.insert(encode_ss58(&neuron.hotkey));
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check Cargo.toml for bittensor-rs dependency
echo "=== Checking Cargo.toml for bittensor-rs dependency ==="
rg -A 3 'bittensor-rs' Cargo.toml

echo ""
echo "=== Checking Cargo.lock for bittensor-rs revision ==="
rg -A 10 'name = "bittensor-rs"' Cargo.lock

echo ""
echo "=== Checking validator_whitelist.rs for API usage ==="
cat -n src/validator_whitelist.rs | head -100

Repository: PlatformNetwork/term-executor

Length of output: 4507


🏁 Script executed:

#!/bin/bash
# Check if we can inspect the bittensor-rs APIs at the pinned commit
echo "=== Searching for bittensor-rs function/struct definitions online or in cache ==="

# Try to find information about bittensor-rs API
# First, let's check if there are any other Rust files that import from bittensor-rs
rg -n 'use bittensor_rs::' src/ -A 2

echo ""
echo "=== Check if there are any documentation or comments about the APIs ==="
rg -B 2 -A 2 'with_failover|sync_metagraph|validator_permit' src/

Repository: PlatformNetwork/term-executor

Length of output: 2093


🏁 Script executed:

#!/bin/bash
echo "=== Check if Cargo.lock exists and is tracked ==="
git ls-files | grep -E "Cargo\.lock|Cargo\.toml"

echo ""
echo "=== Check git history of Cargo.lock if it exists ==="
git log --oneline Cargo.lock 2>/dev/null | head -5 || echo "Cargo.lock not in git history or doesn't exist"

echo ""
echo "=== Check if there's a .gitignore entry for Cargo.lock ==="
cat .gitignore 2>/dev/null | grep -i "cargo.lock" || echo "No explicit Cargo.lock entry in .gitignore"

Repository: PlatformNetwork/term-executor

Length of output: 412


Pin bittensor-rs dependency to a specific commit instead of unpinned branch = "main".

The Cargo.toml uses branch = "main" without a pinned commit. While Cargo.lock is currently tracked and provides reproducibility, running cargo update could pull breaking changes from bittensor-rs main, affecting APIs like BittensorClient::with_failover(), sync_metagraph(), and neuron fields. Pin to a specific commit (e.g., rev = "eb58916af5a4d7fef74ef00ea0d61519880b101f") to ensure stability.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/validator_whitelist.rs` around lines 62 - 77, The dependency for
bittensor-rs is currently tracked by branch="main" which can introduce breaking
API changes affecting BittensorClient::with_failover, sync_metagraph, and neuron
field access; update Cargo.toml to replace the unpinned branch with a specific
commit revision (rev = "<commit-hash>", e.g.
eb58916af5a4d7fef74ef00ea0d61519880b101f) so builds are reproducible and APIs
used in try_refresh (BittensorClient::with_failover, sync_metagraph,
neuron.validator_permit, neuron.active, neuron.stake) remain stable. Ensure you
run cargo update -p bittensor-rs if needed and check Cargo.lock reflects the
pinned rev.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments