Conversation
…or whitelisting and 50% consensus Integrate dynamic validator whitelisting from Bittensor netuid 100 and consensus-based evaluation triggering, replacing the previous single AUTHORIZED_HOTKEY + WORKER_API_KEY authentication system. Authentication now uses a dynamic whitelist of validators fetched every 5 minutes from the Bittensor blockchain via bittensor-rs. Validators must have validator_permit, be active, and have >=10,000 TAO stake. POST /submit requests only trigger evaluations when >=50% of whitelisted validators submit the same archive payload (identified by SHA-256 hash). New modules: - src/validator_whitelist.rs: ValidatorWhitelist with parking_lot::RwLock, background refresh loop with 3-retry exponential backoff, connection resilience (keeps cached whitelist on failure), starts empty and rejects requests with 503 until first successful sync - src/consensus.rs: ConsensusManager using DashMap for lock-free vote tracking, PendingConsensus entries with TTL (default 60s), reaper loop every 30s, max 100 pending entries cap, duplicate vote detection Modified modules: - src/auth.rs: Removed AUTHORIZED_HOTKEY import, api_key field from AuthHeaders, X-Api-Key header extraction, InvalidApiKey error variant. verify_request() now takes &ValidatorWhitelist instead of API key string. Updated all tests accordingly. - src/config.rs: Removed AUTHORIZED_HOTKEY constant and worker_api_key field. Added bittensor_netuid, min_validator_stake_tao, validator_refresh_secs, consensus_threshold, consensus_ttl_secs with env var support and sensible defaults. Updated banner output. - src/handlers.rs: Added ValidatorWhitelist and ConsensusManager to AppState. submit_batch now: checks whitelist non-empty (503), validates against whitelist, computes SHA-256 of archive, records consensus vote, returns 202 with pending status or triggers evaluation on consensus. Moved active batch check to consensus-reached branch only. - src/main.rs: Added module declarations, creates ValidatorWhitelist and ConsensusManager, spawns background refresh and reaper tasks. - Cargo.toml: Added bittensor-rs git dependency and mandatory [patch.crates-io] for w3f-bls. - Dockerfile: Added protobuf-compiler, cmake, clang, mold build deps for bittensor-rs substrate dependencies. Copies .cargo config. - AGENTS.md and src/AGENTS.md: Updated data flow, module map, env vars, authentication docs to reflect new architecture. BREAKING CHANGE: WORKER_API_KEY env var and X-Api-Key header no longer required. All validators on Bittensor netuid 100 with sufficient stake are auto-whitelisted.
📝 WalkthroughWalkthroughThis pull request introduces a dynamic validator whitelist and consensus-based voting mechanism to replace fixed API key authentication. New modules ValidatorWhitelist and ConsensusManager handle validator synchronization and consensus tracking, respectively, integrated with background refresh and cleanup loops and updated configuration parameters. Changes
Sequence DiagramsequenceDiagram
participant Client
participant Handler
participant AuthFlow
participant Whitelist
participant ConsensusManager
participant Archive
participant Executor
Client->>Handler: POST /submit_batch (archive + auth)
Handler->>AuthFlow: extract_auth_headers(request)
AuthFlow-->>Handler: AuthHeaders (hotkey, nonce, sig)
Handler->>Whitelist: Check if whitelist initialized
alt Whitelist not ready
Handler-->>Client: 503 Whitelist Not Ready
else Whitelist ready
Handler->>AuthFlow: verify_request(auth, nonce_store, whitelist)
AuthFlow->>Whitelist: is_whitelisted(hotkey)
Whitelist-->>AuthFlow: true/false
alt Invalid auth
AuthFlow-->>Handler: AuthError
Handler-->>Client: 401 Unauthorized
else Valid auth
Handler->>Handler: Compute SHA-256(archive)
Handler->>ConsensusManager: record_vote(hash, hotkey, archive, required, total)
alt Vote already cast by hotkey
ConsensusManager-->>Handler: AlreadyVoted {votes, required}
Handler-->>Client: 202 Accepted (pending)
else Votes not yet reached
ConsensusManager-->>Handler: Pending {votes, required}
Handler-->>Client: 202 Accepted (pending)
else Consensus reached
ConsensusManager-->>Handler: Reached {archive_data, ...}
Handler->>Archive: Extract archive data
Archive-->>Handler: Extracted content
Handler->>Executor: Spawn batch task
Executor->>Executor: Process batch
Handler-->>Client: 202 Accepted (consensus_reached)
end
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes The review requires careful analysis of new consensus voting semantics (DashMap concurrency, TTL logic, voting idempotency), validator whitelist Bittensor integration (metagraph sync, retry logic), authentication flow refactoring, submit_batch state machine (multiple consensus outcomes), and orchestration of background tasks. While changes follow consistent patterns across files, the heterogeneity of new logic (consensus, whitelist, auth, handlers, config) and interdependencies between modules demand separate reasoning for each component. Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 9
🧹 Nitpick comments (8)
src/main.rs (1)
42-42: Hardcodedmax_pending = 100— consider making it configurable.Other consensus/whitelist parameters are already configurable via
Configand environment variables. Themax_pendingcap forConsensusManageris hardcoded here, making it the odd one out. Consider adding aconsensus_max_pendingfield toConfigfor consistency.♻️ Suggested change (main.rs)
- let consensus_manager = consensus::ConsensusManager::new(100); + let consensus_manager = consensus::ConsensusManager::new(config.consensus_max_pending);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/main.rs` at line 42, The ConsensusManager is being constructed with a hardcoded max_pending (ConsensusManager::new(100)); make this value configurable by adding a consensus_max_pending field to the existing Config and wiring it through where ConsensusManager is created: add consensus_max_pending to Config, load it from the same env/config source used by other fields, and replace the literal 100 in the call to ConsensusManager::new(...) with config.consensus_max_pending so the cap is consistent with other configurable consensus/whitelist parameters.src/auth.rs (2)
72-95: Whitelist check before SS58 validation — good security posture, but note the pre-existing TOCTOU inNonceStore.The ordering (whitelist → SS58 → nonce → signature) is correct: it avoids leaking format-validity information to unauthorized callers and fails fast.
However,
NonceStore::check_and_insert(lines 22-27) has a pre-existing TOCTOU race —contains_key+insertonDashMapis not atomic. Two concurrent requests with the same nonce could both pass the check. Consider usingDashMap::entryfor atomic insert-if-absent in a follow-up.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/auth.rs` around lines 72 - 95, NonceStore::check_and_insert currently does a non-atomic contains_key + insert on the DashMap which creates a TOCTOU race allowing two concurrent requests with the same nonce to both succeed; change the implementation of NonceStore::check_and_insert to use DashMap::entry (or equivalent atomic “insert-if-absent” API) so the check-and-insert is performed atomically and returns false if the entry already exists, ensuring only the first inserter succeeds.
254-265: Missing test:verify_requestwith a whitelisted hotkey.The test covers the rejection path (non-whitelisted hotkey), but there's no test that exercises the full happy path of
verify_requestwith a hotkey that IS in the whitelist (which would then proceed to SS58 validation, nonce check, and signature verification). Consider adding a roundtrip test that inserts a key into the whitelist and callsverify_request.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/auth.rs` around lines 254 - 265, Add a positive unit test that exercises verify_request with a whitelisted hotkey: create a NonceStore and a ValidatorWhitelist, insert a valid SS58 hotkey into the whitelist (using the same format the code expects), create an AuthHeaders with that hotkey, a matching nonce present in NonceStore, and a signature that will pass verification (either by generating a real signature with the project’s signing helper or by using the existing test helper/mocking used elsewhere); call verify_request(&auth, &store, &wl) and assert it returns Ok (and any expected AuthResult fields). Ensure the test name reflects the happy path (e.g., test_verify_request_whitelisted) and reuse symbols verify_request, ValidatorWhitelist, NonceStore, and AuthHeaders so the test mirrors the rejection test structure.src/validator_whitelist.rs (1)
86-120: Tests cover sync methods well; consider adding a test forrefresh_oncefailure path.The existing unit tests validate the basic data structure operations. Since
try_refreshdepends on external RPC, a full integration test isn't expected here, but you could add a test that directly writes tohotkeysand then verifies that a failedrefresh_oncepreserves the cached whitelist (the "keeps cached whitelist on failure" behavior documented in AGENTS.md).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/validator_whitelist.rs` around lines 86 - 120, Add a unit test that ensures ValidatorWhitelist keeps its cached hotkeys when refresh_once fails: create a ValidatorWhitelist via ValidatorWhitelist::new(), populate wl.hotkeys.write().insert(...) with a known hotkey, invoke wl.refresh_once(...) in a way that causes it to return an error (simulate RPC failure or call the method path that returns Err), assert the call returns Err, and finally assert wl.validator_count() and wl.is_whitelisted(...) still reflect the original cached entry; reference ValidatorWhitelist::refresh_once, ValidatorWhitelist::new, wl.hotkeys, validator_count, and is_whitelisted to locate the code to test.Cargo.toml (1)
65-72: Pin git dependencies to a specificrevfor reproducible builds.Both
bittensor-rsand thew3f-blspatch reference a branch (main/fix-no-std). WhileCargo.lockcaptures the resolved commit, anycargo updatewill silently pull in whatever HEAD is at that moment, risking breakage or supply-chain issues. Pin to an explicit commit:♻️ Suggested change
-bittensor-rs = { git = "https://github.com/cortexlm/bittensor-rs", branch = "main" } +bittensor-rs = { git = "https://github.com/cortexlm/bittensor-rs", rev = "<pin-commit-sha>" } [patch.crates-io] -w3f-bls = { git = "https://github.com/opentensor/bls", branch = "fix-no-std" } +w3f-bls = { git = "https://github.com/opentensor/bls", rev = "<pin-commit-sha>" }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Cargo.toml` around lines 65 - 72, The git dependencies are pinned to branches which risks non-reproducible builds; update the Cargo.toml entries for the bittensor-rs dependency and the w3f-bls patch to use explicit commit SHAs by replacing the branch = "..." keys with rev = "<commit-sha>" for the specific commits you want to lock to (keep the git URLs and any other keys intact), ensuring both the bittensor-rs and w3f-bls entries reference concrete commit SHAs rather than branches.src/consensus.rs (3)
104-106:is_at_capacityis checked beforerecord_votein the handler, but there's no atomic guard.If
is_at_capacity()returnsfalseand multiple threads pass this check simultaneously, they could all insert, exceedingmax_pending. This is a soft limit and the overshoot is bounded by the number of concurrent requests, so it's acceptable as a best-effort guard. Just noting it isn't a strict cap.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/consensus.rs` around lines 104 - 106, is_at_capacity currently only reads self.pending.len() so concurrent handlers can all observe false and then call record_vote, causing bounded overshoot; make the capacity check and insertion atomic by moving the check into record_vote (or protecting both with the same lock) so you verify pending.len() < max_pending while holding the mutex that guards pending (or use the map's entry API to conditionally insert only if vacant and size still < max_pending). Update record_vote to perform the length check under the pending lock (or use pending.entry/vacant check) and return a clear capacity error if the insertion would exceed max_pending; keep is_at_capacity as a best-effort helper if needed but do not rely on it for correctness.
133-231: Tests cover the core scenarios well, but consider adding a concurrent voting test.The existing tests validate single vote, threshold, duplicates, independent hashes, TTL, and capacity. Given the race condition identified above, an additional multi-threaded test using
std::thread::scopeortokio::spawnwith multiple voters racing on the same hash would help validate the fix.Also, per coding guidelines, async tests should use
#[tokio::test]— thereaper_loopis async but not tested here. A test that inserts an expired entry, advances time (e.g., viatokio::time::pause()), and verifies reaping would strengthen coverage.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/consensus.rs` around lines 133 - 231, Add a concurrent voting test that races multiple callers of ConsensusManager::record_vote on the same hash to ensure no double-counting (use std::thread::scope or tokio::spawn with multiple voters and assert final ConsensusStatus or pending_count), and add an async #[tokio::test] that exercises reaper_loop by inserting an expired entry into ConsensusManager::pending, calling tokio::time::pause()/advance to simulate time passage, running reaper_loop (or invoking the reap logic) and asserting the expired entry is removed; reference ConsensusManager, record_vote, reaper_loop, pending, pending_count, and is_at_capacity when implementing these tests.
7-12:PendingConsensusstores a full copy of the archive per hash entry.This is fine for correctness since only the first voter's data is kept (subsequent voters'
archive_datais dropped byor_insert_with). However, note that every pending consensus entry holds the full archive bytes in memory. Withmax_pending = 100and potentially large archives, this could consume significant memory. Consider whether storing only the hash and retrieving the archive from the voter that triggers consensus would be more memory-efficient — though that would be a larger design change and is acceptable for now.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/consensus.rs` around lines 7 - 12, PendingConsensus currently stores a full Vec<u8> per entry (archive_data), which can blow up memory; replace archive_data with a shared pointer (e.g., Arc<Vec<u8>>) and ensure wherever a PendingConsensus is created (the or_insert_with site and any constructors) you wrap the archive bytes in Arc::new(...) and clone the Arc when inserting so multiple entries share the same buffer instead of copying it; alternatively, if you prefer the larger design change, store only the archive hash in PendingConsensus and fetch the full archive from the voter that triggers consensus, but for a minimal change use Arc<Vec<u8>> and update all uses of archive_data accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@AGENTS.md`:
- Around line 37-47: The fenced code block describing ValidatorWhitelist refresh
loop and ConsensusManager reaper loop lacks a language specifier (MD040); update
the block to include a plain-text language tag (e.g., use ```text) so the static
analyzer stops flagging it; modify the section containing "ValidatorWhitelist
refresh loop (every 5 minutes):" and "ConsensusManager reaper loop (every 30
seconds):" (references: ValidatorWhitelist, ConsensusManager,
BittensorClient::with_failover, metagraph) to open the fence with ```text and
close as before.
In `@src/AGENTS.md`:
- Around line 54-60: The doc line about duplicate votes is slightly misleading;
update AGENTS.md (consensus.rs section) to state that duplicate votes are not
silently ignored but cause record_vote() to return ConsensusStatus::AlreadyVoted
along with current vote counts, so callers can surface that to clients; mention
ConsensusManager and record_vote() by name and the ConsensusStatus::AlreadyVoted
variant to make the behavior explicit.
- Line 67: Update the submit_batch handler documentation to include the busy
check: modify the flow on line 67 to show auth header extraction → whitelist
empty check (503) → busy check (503) — call to has_active_batch() — →
verify_request (whitelist + nonce + signature) → multipart upload → SHA-256 hash
→ consensus vote → if pending: return 202 with vote count → if reached: archive
extraction → batch creation → executor spawn; reference the submit_batch handler
and the has_active_batch() check so the doc matches runtime behavior.
In `@src/config.rs`:
- Around line 13-17: The CONSENSUS_THRESHOLD value (parsed as f64) has no bounds
check; update the from_env logic that reads CONSENSUS_THRESHOLD so it validates
the parsed value is within a safe range (e.g., > 0.0 and <= 1.0) and handle
out-of-range values by either clamping to DEFAULT_CONSENSUS_THRESHOLD or
returning a clear error/result and logging a warning; reference
DEFAULT_CONSENSUS_THRESHOLD and the CONSENSUS_THRESHOLD env parse in from_env to
locate and change the code.
In `@src/consensus.rs`:
- Around line 79-96: The race comes from dropping the DashMap RefMut (`entry`)
before calling `self.pending.remove(archive_hash)`, allowing another thread to
remove the same key and leaving this thread to return Pending despite consensus
being reached; fix by capturing the map key via `entry.key().clone()` before
dropping, then perform an atomic conditional remove using DashMap's conditional
removal API (e.g., `remove_if`) that validates the stored vote count/consensus
threshold inside the closure (check `votes >= required`) so removal only
succeeds when the map's current state still meets the consensus condition;
update the code paths around `entry`, `self.pending.remove` and the
consensus-return branch (functions/idents: `entry`, `entry.key()`,
`self.pending`, `remove_if`, `ConsensusStatus::Reached`/`Pending`) to use this
pattern.
In `@src/handlers.rs`:
- Around line 189-193: The current synchronous SHA-256 computation using Sha256
on archive_bytes (producing archive_hash) is CPU-bound and runs on the Tokio
worker thread; move that work into a blocking task by using
tokio::task::spawn_blocking to compute the hash so it doesn't block the async
runtime, and avoid clone() by moving archive_bytes into the spawn_blocking
closure and returning both the computed hex hash and the archive_bytes for later
use in record_vote; update the code that awaits spawn_blocking to receive
(archive_hash, archive_bytes) and use those values downstream.
- Around line 243-261: The handler currently calls record_vote() which can
return ConsensusStatus::Reached and remove the pending consensus entry before
checking state.sessions.has_active_batch(), causing consumed votes if the server
is busy; change the flow in the handler to check has_active_batch() first and
return the 503 without touching consensus, or if you must call record_vote()
before the busy-check, capture and persist the consensus state (the data inside
ConsensusStatus::Reached — e.g., archive_data, concurrent_tasks, votes,
required) before record_vote() removes it so you can re-insert it into the
pending map on the busy path; update the code paths around record_vote(),
ConsensusStatus::Reached, and the pending map manipulation in consensus.rs to
ensure validators can re-submit and re-reach consensus when the handler returns
SERVICE_UNAVAILABLE.
In `@src/validator_whitelist.rs`:
- Around line 62-83: The try_refresh function calls
BittensorClient::with_failover() and bittensor_rs::sync_metagraph() without
timeouts; wrap these async calls with tokio::time::timeout (e.g., a configurable
Duration) so a hanging RPC will return an error instead of blocking the refresh
loop, and map timeout errors into the existing anyhow::Error flow (use
anyhow::anyhow!("Timed out ...") or similar). Specifically, apply timeout to the
with_failover() await and to the sync_metagraph(&client, netuid).await calls,
handle the Result from timeout (map Err(elapsed) -> anyhow error), and keep the
rest of try_refresh (new_hotkeys population and hotkeys write) unchanged.
- Around line 62-77: The dependency for bittensor-rs is currently tracked by
branch="main" which can introduce breaking API changes affecting
BittensorClient::with_failover, sync_metagraph, and neuron field access; update
Cargo.toml to replace the unpinned branch with a specific commit revision (rev =
"<commit-hash>", e.g. eb58916af5a4d7fef74ef00ea0d61519880b101f) so builds are
reproducible and APIs used in try_refresh (BittensorClient::with_failover,
sync_metagraph, neuron.validator_permit, neuron.active, neuron.stake) remain
stable. Ensure you run cargo update -p bittensor-rs if needed and check
Cargo.lock reflects the pinned rev.
---
Nitpick comments:
In `@Cargo.toml`:
- Around line 65-72: The git dependencies are pinned to branches which risks
non-reproducible builds; update the Cargo.toml entries for the bittensor-rs
dependency and the w3f-bls patch to use explicit commit SHAs by replacing the
branch = "..." keys with rev = "<commit-sha>" for the specific commits you want
to lock to (keep the git URLs and any other keys intact), ensuring both the
bittensor-rs and w3f-bls entries reference concrete commit SHAs rather than
branches.
In `@src/auth.rs`:
- Around line 72-95: NonceStore::check_and_insert currently does a non-atomic
contains_key + insert on the DashMap which creates a TOCTOU race allowing two
concurrent requests with the same nonce to both succeed; change the
implementation of NonceStore::check_and_insert to use DashMap::entry (or
equivalent atomic “insert-if-absent” API) so the check-and-insert is performed
atomically and returns false if the entry already exists, ensuring only the
first inserter succeeds.
- Around line 254-265: Add a positive unit test that exercises verify_request
with a whitelisted hotkey: create a NonceStore and a ValidatorWhitelist, insert
a valid SS58 hotkey into the whitelist (using the same format the code expects),
create an AuthHeaders with that hotkey, a matching nonce present in NonceStore,
and a signature that will pass verification (either by generating a real
signature with the project’s signing helper or by using the existing test
helper/mocking used elsewhere); call verify_request(&auth, &store, &wl) and
assert it returns Ok (and any expected AuthResult fields). Ensure the test name
reflects the happy path (e.g., test_verify_request_whitelisted) and reuse
symbols verify_request, ValidatorWhitelist, NonceStore, and AuthHeaders so the
test mirrors the rejection test structure.
In `@src/consensus.rs`:
- Around line 104-106: is_at_capacity currently only reads self.pending.len() so
concurrent handlers can all observe false and then call record_vote, causing
bounded overshoot; make the capacity check and insertion atomic by moving the
check into record_vote (or protecting both with the same lock) so you verify
pending.len() < max_pending while holding the mutex that guards pending (or use
the map's entry API to conditionally insert only if vacant and size still <
max_pending). Update record_vote to perform the length check under the pending
lock (or use pending.entry/vacant check) and return a clear capacity error if
the insertion would exceed max_pending; keep is_at_capacity as a best-effort
helper if needed but do not rely on it for correctness.
- Around line 133-231: Add a concurrent voting test that races multiple callers
of ConsensusManager::record_vote on the same hash to ensure no double-counting
(use std::thread::scope or tokio::spawn with multiple voters and assert final
ConsensusStatus or pending_count), and add an async #[tokio::test] that
exercises reaper_loop by inserting an expired entry into
ConsensusManager::pending, calling tokio::time::pause()/advance to simulate time
passage, running reaper_loop (or invoking the reap logic) and asserting the
expired entry is removed; reference ConsensusManager, record_vote, reaper_loop,
pending, pending_count, and is_at_capacity when implementing these tests.
- Around line 7-12: PendingConsensus currently stores a full Vec<u8> per entry
(archive_data), which can blow up memory; replace archive_data with a shared
pointer (e.g., Arc<Vec<u8>>) and ensure wherever a PendingConsensus is created
(the or_insert_with site and any constructors) you wrap the archive bytes in
Arc::new(...) and clone the Arc when inserting so multiple entries share the
same buffer instead of copying it; alternatively, if you prefer the larger
design change, store only the archive hash in PendingConsensus and fetch the
full archive from the voter that triggers consensus, but for a minimal change
use Arc<Vec<u8>> and update all uses of archive_data accordingly.
In `@src/main.rs`:
- Line 42: The ConsensusManager is being constructed with a hardcoded
max_pending (ConsensusManager::new(100)); make this value configurable by adding
a consensus_max_pending field to the existing Config and wiring it through where
ConsensusManager is created: add consensus_max_pending to Config, load it from
the same env/config source used by other fields, and replace the literal 100 in
the call to ConsensusManager::new(...) with config.consensus_max_pending so the
cap is consistent with other configurable consensus/whitelist parameters.
In `@src/validator_whitelist.rs`:
- Around line 86-120: Add a unit test that ensures ValidatorWhitelist keeps its
cached hotkeys when refresh_once fails: create a ValidatorWhitelist via
ValidatorWhitelist::new(), populate wl.hotkeys.write().insert(...) with a known
hotkey, invoke wl.refresh_once(...) in a way that causes it to return an error
(simulate RPC failure or call the method path that returns Err), assert the call
returns Err, and finally assert wl.validator_count() and wl.is_whitelisted(...)
still reflect the original cached entry; reference
ValidatorWhitelist::refresh_once, ValidatorWhitelist::new, wl.hotkeys,
validator_count, and is_whitelisted to locate the code to test.
| ``` | ||
| ValidatorWhitelist refresh loop (every 5 minutes): | ||
| 1. Connect to Bittensor subtensor via BittensorClient::with_failover() | ||
| 2. Sync metagraph for netuid 100 | ||
| 3. Filter validators: validator_permit && active && stake >= 10,000 TAO | ||
| 4. Atomically replace whitelist with new set of SS58 hotkeys | ||
| 5. On failure: retry up to 3 times with exponential backoff, keep cached whitelist | ||
|
|
||
| ConsensusManager reaper loop (every 30 seconds): | ||
| 1. Remove pending consensus entries older than TTL (default 60s) | ||
| ``` |
There was a problem hiding this comment.
Add language specifier to fenced code block.
The static analysis tool flags this code block as missing a language identifier (MD040). Since it's a plain-text description rather than code, use text or plaintext:
♻️ Suggested fix
-```
+```text
ValidatorWhitelist refresh loop (every 5 minutes):📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ``` | |
| ValidatorWhitelist refresh loop (every 5 minutes): | |
| 1. Connect to Bittensor subtensor via BittensorClient::with_failover() | |
| 2. Sync metagraph for netuid 100 | |
| 3. Filter validators: validator_permit && active && stake >= 10,000 TAO | |
| 4. Atomically replace whitelist with new set of SS58 hotkeys | |
| 5. On failure: retry up to 3 times with exponential backoff, keep cached whitelist | |
| ConsensusManager reaper loop (every 30 seconds): | |
| 1. Remove pending consensus entries older than TTL (default 60s) | |
| ``` |
🧰 Tools
🪛 markdownlint-cli2 (0.21.0)
[warning] 37-37: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@AGENTS.md` around lines 37 - 47, The fenced code block describing
ValidatorWhitelist refresh loop and ConsensusManager reaper loop lacks a
language specifier (MD040); update the block to include a plain-text language
tag (e.g., use ```text) so the static analyzer stops flagging it; modify the
section containing "ValidatorWhitelist refresh loop (every 5 minutes):" and
"ConsensusManager reaper loop (every 30 seconds):" (references:
ValidatorWhitelist, ConsensusManager, BittensorClient::with_failover, metagraph)
to open the fence with ```text and close as before.
| ### `consensus.rs` | ||
| - `ConsensusManager` — `DashMap<String, PendingConsensus>` keyed by SHA-256 hex hash of archive bytes. | ||
| - `PendingConsensus` — holds archive data, voter set, creation time, concurrent_tasks setting. | ||
| - `record_vote()` — adds a validator's vote for an archive hash; returns `ConsensusStatus` (Pending, Reached, AlreadyVoted). | ||
| - `is_at_capacity()` — checks if max pending entries reached (prevents memory exhaustion). | ||
| - `reaper_loop(ttl_secs)` — background task that removes expired entries every 30 seconds. | ||
| - **Convention**: Consensus entries have a 60-second TTL. Max 100 pending entries. Duplicate votes from the same validator are silently acknowledged. |
There was a problem hiding this comment.
Documentation for consensus.rs accurately reflects the implementation.
One minor note: Line 60 states "Duplicate votes from the same validator are silently acknowledged" — this is slightly imprecise. The code returns ConsensusStatus::AlreadyVoted with vote counts, which the handler presumably surfaces to the caller rather than silently ignoring. Consider rewording to "Duplicate votes return an AlreadyVoted status with current vote counts."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/AGENTS.md` around lines 54 - 60, The doc line about duplicate votes is
slightly misleading; update AGENTS.md (consensus.rs section) to state that
duplicate votes are not silently ignored but cause record_vote() to return
ConsensusStatus::AlreadyVoted along with current vote counts, so callers can
surface that to clients; mention ConsensusManager and record_vote() by name and
the ConsensusStatus::AlreadyVoted variant to make the behavior explicit.
| - Route handlers: `health`, `status`, `metrics`, `submit_batch`, `get_batch`, `get_batch_tasks`, `get_task`, `list_batches`. | ||
| - Routes: `GET /health`, `GET /status`, `GET /metrics`, `POST /submit`, `GET /batch/{id}`, `GET /batch/{id}/tasks`, `GET /batch/{id}/task/{task_id}`, `GET /batches`, `GET /ws`. | ||
| - `submit_batch` handler does: auth header extraction → `verify_request` (hotkey + API key + nonce + signature) → busy check → multipart upload → archive extraction → batch creation → executor spawn. | ||
| - `submit_batch` handler does: auth header extraction → whitelist empty check (503) → `verify_request` (whitelist + nonce + signature) → multipart upload → SHA-256 hash → consensus vote → if pending: return 202 with vote count → if reached: archive extraction → batch creation → executor spawn. |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
rg -n 'has_active_batch' src/handlers.rsRepository: PlatformNetwork/term-executor
Length of output: 222
🏁 Script executed:
sed -n '60,85p' src/handlers.rsRepository: PlatformNetwork/term-executor
Length of output: 1176
🏁 Script executed:
sed -n '250,260p' src/handlers.rsRepository: PlatformNetwork/term-executor
Length of output: 546
Add the busy check to the documented submit_batch flow on line 67.
The has_active_batch() check is still enforced in the handler (returning 503 if a batch is running). The documented flow should include this step: auth header extraction → whitelist empty check (503) → busy check (503) → verify_request (whitelist + nonce + signature) → multipart upload → ... The busy check occurs before multipart processing and prevents concurrent batch submissions.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/AGENTS.md` at line 67, Update the submit_batch handler documentation to
include the busy check: modify the flow on line 67 to show auth header
extraction → whitelist empty check (503) → busy check (503) — call to
has_active_batch() — → verify_request (whitelist + nonce + signature) →
multipart upload → SHA-256 hash → consensus vote → if pending: return 202 with
vote count → if reached: archive extraction → batch creation → executor spawn;
reference the submit_batch handler and the has_active_batch() check so the doc
matches runtime behavior.
| const DEFAULT_BITTENSOR_NETUID: u16 = 100; | ||
| const DEFAULT_MIN_VALIDATOR_STAKE_TAO: f64 = 10_000.0; | ||
| const DEFAULT_VALIDATOR_REFRESH_SECS: u64 = 300; | ||
| const DEFAULT_CONSENSUS_THRESHOLD: f64 = 0.5; | ||
| const DEFAULT_CONSENSUS_TTL_SECS: u64 = 60; |
There was a problem hiding this comment.
No bounds validation on consensus_threshold.
CONSENSUS_THRESHOLD is parsed as f64 with no range check. A value of 0.0 makes every single vote trigger consensus immediately (security bypass), and a value > 1.0 makes consensus permanently unreachable. Consider clamping or validating in from_env:
🛡️ Suggested validation
+ consensus_threshold: {
+ let t = env_parse("CONSENSUS_THRESHOLD", DEFAULT_CONSENSUS_THRESHOLD);
+ t.clamp(0.01, 1.0)
+ },
- consensus_threshold: env_parse("CONSENSUS_THRESHOLD", DEFAULT_CONSENSUS_THRESHOLD),🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/config.rs` around lines 13 - 17, The CONSENSUS_THRESHOLD value (parsed as
f64) has no bounds check; update the from_env logic that reads
CONSENSUS_THRESHOLD so it validates the parsed value is within a safe range
(e.g., > 0.0 and <= 1.0) and handle out-of-range values by either clamping to
DEFAULT_CONSENSUS_THRESHOLD or returning a clear error/result and logging a
warning; reference DEFAULT_CONSENSUS_THRESHOLD and the CONSENSUS_THRESHOLD env
parse in from_env to locate and change the code.
| if votes >= required { | ||
| drop(entry); | ||
| if let Some((_, consensus)) = self.pending.remove(archive_hash) { | ||
| info!(archive_hash, votes, required, "Consensus reached"); | ||
| return ConsensusStatus::Reached { | ||
| archive_data: consensus.archive_data, | ||
| concurrent_tasks: consensus.concurrent_tasks, | ||
| votes, | ||
| required, | ||
| }; | ||
| } | ||
| } | ||
|
|
||
| ConsensusStatus::Pending { | ||
| votes, | ||
| required, | ||
| total_validators, | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
rg -n 'dashmap' Cargo.tomlRepository: PlatformNetwork/term-executor
Length of output: 89
🏁 Script executed:
cat -n src/consensus.rs | sed -n '70,110p'Repository: PlatformNetwork/term-executor
Length of output: 1525
🏁 Script executed:
fd -t f consensus.rsRepository: PlatformNetwork/term-executor
Length of output: 89
🏁 Script executed:
cat -n src/consensus.rs | sed -n '40,100p'Repository: PlatformNetwork/term-executor
Length of output: 2255
🏁 Script executed:
rg -n 'fn record_vote' src/consensus.rs -A 30Repository: PlatformNetwork/term-executor
Length of output: 1113
🏁 Script executed:
rg -n '#\[test\]|#\[tokio::test\]' src/consensus.rs | head -20Repository: PlatformNetwork/term-executor
Length of output: 168
🏁 Script executed:
rg -n 'remove_if' src/Repository: PlatformNetwork/term-executor
Length of output: 55
🏁 Script executed:
cat -n src/consensus.rs | sed -n '137,220p'Repository: PlatformNetwork/term-executor
Length of output: 3512
🏁 Script executed:
web_search
dashmap 6 remove_if method availabilityRepository: PlatformNetwork/term-executor
Length of output: 169
🌐 Web query:
dashmap 6 remove_if method documentation
💡 Result:
In dashmap v6 (DashMap), remove_if is documented on the DashMap type as: (docs.rs)
pub fn remove_if<Q>(
&self,
key: &Q,
f: impl FnOnce(&K, &V) -> bool,
) -> Option<(K, V)>
where
K: Borrow<Q>,
Q: Hash + Eq + ?Sized,Behavior: Removes the entry for key and returns Some((K, V)) only if the key exists and the predicate f(&K, &V) returns true; otherwise returns None. (docs.rs)
Locking / safety note: The docs warn it may deadlock if called while holding any kind of reference into the map. (docs.rs)
Citations:
- 1: https://docs.rs/dashmap/latest/dashmap/struct.DashMap.html?utm_source=openai
- 2: https://docs.rs/dashmap/latest/dashmap/struct.DashMap.html?utm_source=openai
- 3: https://docs.rs/dashmap/latest/dashmap/struct.DashMap.html?utm_source=openai
🏁 Script executed:
rg -n 'entry\(.*\)' src/consensus.rs | head -10Repository: PlatformNetwork/term-executor
Length of output: 120
🏁 Script executed:
rg -n 'use std::sync::Arc' src/consensus.rsRepository: PlatformNetwork/term-executor
Length of output: 94
🏁 Script executed:
cat -n src/consensus.rs | sed -n '1,30p'Repository: PlatformNetwork/term-executor
Length of output: 941
🏁 Script executed:
rg -n 'tokio::spawn\|thread::spawn' src/consensus.rsRepository: PlatformNetwork/term-executor
Length of output: 55
🏁 Script executed:
rg -n 'pub.*fn.*key' src/consensus.rsRepository: PlatformNetwork/term-executor
Length of output: 55
🏁 Script executed:
rg -n 'RefMut' src/consensus.rsRepository: PlatformNetwork/term-executor
Length of output: 55
🏁 Script executed:
python3 << 'EOF'
# Check DashMap RefMut API in dashmap 6
# The question is whether RefMut::key() exists
# From the code, entry is created via:
# let mut entry = self.pending.entry(archive_hash.to_string()).or_insert_with(...)
# This returns a RefMut
# The proposed fix calls entry.key() but RefMut likely doesn't have key()
# We can check by looking at how the entry is accessed
print("DashMap RefMut behavior:")
print("- RefMut provides mutable access to the VALUE")
print("- The key is NOT accessible from RefMut after entry() creates it")
print("- Would need to call entry.key() before mut borrow, or store key separately")
EOFRepository: PlatformNetwork/term-executor
Length of output: 284
🏁 Script executed:
cat -n src/consensus.rs | sed -n '46,97p'Repository: PlatformNetwork/term-executor
Length of output: 2031
🌐 Web query:
dashmap RefMut key() method
💡 Result:
In DashMap, dashmap::mapref::one::RefMut<'a, K, V>::key() returns an immutable reference to the key for the entry you’re holding:
- Signature:
pub fn key(&self) -> &K[1] - You can call it on a
RefMutyou get from e.g.get_mut,entry(...).or_insert(...), etc. (it does not require&mut self). [1][2]
Example:
use dashmap::DashMap;
let m: DashMap<String, i32> = DashMap::new();
m.insert("a".to_string(), 10);
if let Some(mut r) = m.get_mut("a") {
let k: &String = r.key();
*r.value_mut() += 1;
println!("{k} -> {}", *r);
}Notes:
- The returned
&Kis valid only as long as theRefMutguard is alive (it’s tied to the same borrow/lock scope). [1]
Sources:
[1] DashMap RefMut docs (methods incl. key) https://docs.rs/dashmap/latest/dashmap/mapref/one/struct.RefMut.html
[2] DashMap Entry docs (ways you may obtain a RefMut, e.g. or_default) https://docs.rs/dashmap/latest/dashmap/mapref/entry/enum.Entry.html
🏁 Script executed:
rg -n 'pub fn key' src/consensus.rsRepository: PlatformNetwork/term-executor
Length of output: 55
🏁 Script executed:
cat -n src/consensus.rs | sed -n '46,67p' > /tmp/entry_code.txt && python3 << 'EOF'
# Analyze the DashMap entry API usage
with open('/tmp/entry_code.txt') as f:
content = f.read()
print("Entry usage analysis:")
print(content)
print("\n---\nKey observation:")
print("entry is RefMut<'_, K, V>")
print("RefMut provides: &V, &mut V via Deref/DerefMut")
print("RefMut does NOT have key() method in standard DashMap API")
print("To access key, need to call before mut borrow or use entry.key()")
print("Actually, let's check: DashMap::Entry API...")
EOFRepository: PlatformNetwork/term-executor
Length of output: 1188
Race condition between drop(entry) and self.pending.remove() is real and critical.
After drop(entry) releases the DashMap shard lock at line 80, a concurrent thread executing record_vote on the same archive_hash can race to call remove() after observing votes >= required. Both threads may reach the removal step, but only the first succeeds. The second thread falls through to return ConsensusStatus::Pending with votes >= required, creating a contradictory state where consensus has been reached and the entry removed, yet the caller receives Pending. This can cause unnecessary retries and logic errors in the caller.
The proposed fix's explanation is incomplete. Draining archive_data does not prevent the race condition — the race occurs on the votes count and the remove() operation itself, not on data contents. However, the technical approach of calling entry.key() before drop() is correct (RefMut provides key(&self) -> &K in DashMap 6). A more robust solution requires atomicity: either hold the RefMut through the removal decision (if DashMap's API permits), use remove_if with a closure validating the vote threshold (though the deadlock warning in DashMap's docs must be heeded), or introduce additional synchronization to serialize the consensus-reached path. The learnings for this codebase emphasize proper synchronization for shared state — this code needs strengthening to uphold that standard.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/consensus.rs` around lines 79 - 96, The race comes from dropping the
DashMap RefMut (`entry`) before calling `self.pending.remove(archive_hash)`,
allowing another thread to remove the same key and leaving this thread to return
Pending despite consensus being reached; fix by capturing the map key via
`entry.key().clone()` before dropping, then perform an atomic conditional remove
using DashMap's conditional removal API (e.g., `remove_if`) that validates the
stored vote count/consensus threshold inside the closure (check `votes >=
required`) so removal only succeeds when the map's current state still meets the
consensus condition; update the code paths around `entry`, `self.pending.remove`
and the consensus-return branch (functions/idents: `entry`, `entry.key()`,
`self.pending`, `remove_if`, `ConsensusStatus::Reached`/`Pending`) to use this
pattern.
| let archive_hash = { | ||
| let mut hasher = Sha256::new(); | ||
| hasher.update(&archive_bytes); | ||
| hex::encode(hasher.finalize()) | ||
| }; |
There was a problem hiding this comment.
SHA-256 of up to 500 MB archive blocks the async runtime — use spawn_blocking.
Sha256 hashing up to max_archive_bytes (500 MB) is CPU-bound work running directly on the Tokio worker thread. This will starve other request handlers during the hash computation.
🐛 Proposed fix
- let archive_hash = {
- let mut hasher = Sha256::new();
- hasher.update(&archive_bytes);
- hex::encode(hasher.finalize())
- };
+ let archive_hash = {
+ let bytes = archive_bytes.clone();
+ tokio::task::spawn_blocking(move || {
+ let mut hasher = Sha256::new();
+ hasher.update(&bytes);
+ hex::encode(hasher.finalize())
+ })
+ .await
+ .map_err(|e| (
+ StatusCode::INTERNAL_SERVER_ERROR,
+ Json(serde_json::json!({
+ "error": "hash_failed",
+ "message": format!("Failed to compute archive hash: {}", e)
+ })),
+ ))?
+ };Note: Ideally avoid the clone() by moving archive_bytes into the blocking task and returning it alongside the hash, since it's needed later for record_vote. As per coding guidelines: "Do NOT add synchronous blocking I/O in async functions — use tokio::task::spawn_blocking for CPU-heavy work."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/handlers.rs` around lines 189 - 193, The current synchronous SHA-256
computation using Sha256 on archive_bytes (producing archive_hash) is CPU-bound
and runs on the Tokio worker thread; move that work into a blocking task by
using tokio::task::spawn_blocking to compute the hash so it doesn't block the
async runtime, and avoid clone() by moving archive_bytes into the spawn_blocking
closure and returning both the computed hex hash and the archive_bytes for later
use in record_vote; update the code that awaits spawn_blocking to receive
(archive_hash, archive_bytes) and use those values downstream.
| ConsensusStatus::Reached { | ||
| archive_data, | ||
| concurrent_tasks, | ||
| votes, | ||
| required, | ||
| } => { | ||
| let effective_concurrent = concurrent_tasks | ||
| .unwrap_or(state.config.max_concurrent_tasks) | ||
| .min(state.config.max_concurrent_tasks); | ||
|
|
||
| if state.sessions.has_active_batch() { | ||
| return Err(( | ||
| StatusCode::SERVICE_UNAVAILABLE, | ||
| Json(serde_json::json!({ | ||
| "error": "busy", | ||
| "message": "A batch is already running. Wait for it to complete." | ||
| })), | ||
| )); | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
fd -t f "\.rs$" | head -20Repository: PlatformNetwork/term-executor
Length of output: 253
🏁 Script executed:
git ls-files "*.rs" | grep -E "(consensus|manager)" | head -20Repository: PlatformNetwork/term-executor
Length of output: 89
🏁 Script executed:
rg "fn record_vote" --type rs -B 2 -A 15Repository: PlatformNetwork/term-executor
Length of output: 99
🏁 Script executed:
rg "ConsensusManager" --type rs -l | head -10Repository: PlatformNetwork/term-executor
Length of output: 99
🏁 Script executed:
rg "fn record_vote" -A 20Repository: PlatformNetwork/term-executor
Length of output: 1104
🏁 Script executed:
cat -n src/consensus.rs | head -100Repository: PlatformNetwork/term-executor
Length of output: 3369
🏁 Script executed:
cat -n src/handlers.rs | sed -n '230,280p'Repository: PlatformNetwork/term-executor
Length of output: 2318
🏁 Script executed:
cat -n src/handlers.rs | sed -n '190,245p'Repository: PlatformNetwork/term-executor
Length of output: 2197
🏁 Script executed:
rg "record_vote" src/handlers.rs -B 5 -A 30Repository: PlatformNetwork/term-executor
Length of output: 1097
Consensus entry removed before busy check → validators must re-reach consensus if server is busy.
When record_vote() returns ConsensusStatus::Reached, the consensus entry is already removed from the pending map (line 81 in src/consensus.rs). If has_active_batch() returns true on line 253, the handler returns 503, but the consensus votes are already consumed. Validators must re-submit and re-reach consensus.
Suggested fixes:
- Check
has_active_batch()before callingrecord_vote()to prevent consensus consumption while the server is busy. - Store consensus state separately before removal, so failed requests can re-insert it for retry.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/handlers.rs` around lines 243 - 261, The handler currently calls
record_vote() which can return ConsensusStatus::Reached and remove the pending
consensus entry before checking state.sessions.has_active_batch(), causing
consumed votes if the server is busy; change the flow in the handler to check
has_active_batch() first and return the 503 without touching consensus, or if
you must call record_vote() before the busy-check, capture and persist the
consensus state (the data inside ConsensusStatus::Reached — e.g., archive_data,
concurrent_tasks, votes, required) before record_vote() removes it so you can
re-insert it into the pending map on the busy path; update the code paths around
record_vote(), ConsensusStatus::Reached, and the pending map manipulation in
consensus.rs to ensure validators can re-submit and re-reach consensus when the
handler returns SERVICE_UNAVAILABLE.
| async fn try_refresh(&self, netuid: u16, min_stake_tao: f64) -> anyhow::Result<usize> { | ||
| use bittensor_rs::ss58::encode_ss58; | ||
|
|
||
| let client = bittensor_rs::BittensorClient::with_failover() | ||
| .await | ||
| .map_err(|e| anyhow::anyhow!("Failed to connect to subtensor: {}", e))?; | ||
|
|
||
| let metagraph = bittensor_rs::sync_metagraph(&client, netuid) | ||
| .await | ||
| .map_err(|e| anyhow::anyhow!("Failed to sync metagraph: {}", e))?; | ||
|
|
||
| let mut new_hotkeys = HashSet::new(); | ||
| for neuron in metagraph.neurons.values() { | ||
| if neuron.validator_permit && neuron.active && neuron.stake.as_tao() >= min_stake_tao { | ||
| new_hotkeys.insert(encode_ss58(&neuron.hotkey)); | ||
| } | ||
| } | ||
|
|
||
| let count = new_hotkeys.len(); | ||
| *self.hotkeys.write() = new_hotkeys; | ||
| Ok(count) | ||
| } |
There was a problem hiding this comment.
No timeout on Bittensor RPC calls — a hanging endpoint will block the refresh loop indefinitely.
BittensorClient::with_failover() and sync_metagraph() are async calls with no timeout guard. If the subtensor RPC hangs, this function will never return, and the refresh loop will stall permanently (the 3-retry logic won't help since the first attempt never completes).
Wrap with tokio::time::timeout:
🛡️ Suggested timeout wrapper
async fn try_refresh(&self, netuid: u16, min_stake_tao: f64) -> anyhow::Result<usize> {
use bittensor_rs::ss58::encode_ss58;
+ use tokio::time::timeout;
+ use std::time::Duration;
+
+ const RPC_TIMEOUT: Duration = Duration::from_secs(30);
- let client = bittensor_rs::BittensorClient::with_failover()
+ let client = timeout(RPC_TIMEOUT, bittensor_rs::BittensorClient::with_failover())
.await
+ .map_err(|_| anyhow::anyhow!("Subtensor connection timed out"))?
.map_err(|e| anyhow::anyhow!("Failed to connect to subtensor: {}", e))?;
- let metagraph = bittensor_rs::sync_metagraph(&client, netuid)
+ let metagraph = timeout(RPC_TIMEOUT, bittensor_rs::sync_metagraph(&client, netuid))
.await
+ .map_err(|_| anyhow::anyhow!("Metagraph sync timed out"))?
.map_err(|e| anyhow::anyhow!("Failed to sync metagraph: {}", e))?;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| async fn try_refresh(&self, netuid: u16, min_stake_tao: f64) -> anyhow::Result<usize> { | |
| use bittensor_rs::ss58::encode_ss58; | |
| let client = bittensor_rs::BittensorClient::with_failover() | |
| .await | |
| .map_err(|e| anyhow::anyhow!("Failed to connect to subtensor: {}", e))?; | |
| let metagraph = bittensor_rs::sync_metagraph(&client, netuid) | |
| .await | |
| .map_err(|e| anyhow::anyhow!("Failed to sync metagraph: {}", e))?; | |
| let mut new_hotkeys = HashSet::new(); | |
| for neuron in metagraph.neurons.values() { | |
| if neuron.validator_permit && neuron.active && neuron.stake.as_tao() >= min_stake_tao { | |
| new_hotkeys.insert(encode_ss58(&neuron.hotkey)); | |
| } | |
| } | |
| let count = new_hotkeys.len(); | |
| *self.hotkeys.write() = new_hotkeys; | |
| Ok(count) | |
| } | |
| async fn try_refresh(&self, netuid: u16, min_stake_tao: f64) -> anyhow::Result<usize> { | |
| use bittensor_rs::ss58::encode_ss58; | |
| use tokio::time::timeout; | |
| use std::time::Duration; | |
| const RPC_TIMEOUT: Duration = Duration::from_secs(30); | |
| let client = timeout(RPC_TIMEOUT, bittensor_rs::BittensorClient::with_failover()) | |
| .await | |
| .map_err(|_| anyhow::anyhow!("Subtensor connection timed out"))? | |
| .map_err(|e| anyhow::anyhow!("Failed to connect to subtensor: {}", e))?; | |
| let metagraph = timeout(RPC_TIMEOUT, bittensor_rs::sync_metagraph(&client, netuid)) | |
| .await | |
| .map_err(|_| anyhow::anyhow!("Metagraph sync timed out"))? | |
| .map_err(|e| anyhow::anyhow!("Failed to sync metagraph: {}", e))?; | |
| let mut new_hotkeys = HashSet::new(); | |
| for neuron in metagraph.neurons.values() { | |
| if neuron.validator_permit && neuron.active && neuron.stake.as_tao() >= min_stake_tao { | |
| new_hotkeys.insert(encode_ss58(&neuron.hotkey)); | |
| } | |
| } | |
| let count = new_hotkeys.len(); | |
| *self.hotkeys.write() = new_hotkeys; | |
| Ok(count) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/validator_whitelist.rs` around lines 62 - 83, The try_refresh function
calls BittensorClient::with_failover() and bittensor_rs::sync_metagraph()
without timeouts; wrap these async calls with tokio::time::timeout (e.g., a
configurable Duration) so a hanging RPC will return an error instead of blocking
the refresh loop, and map timeout errors into the existing anyhow::Error flow
(use anyhow::anyhow!("Timed out ...") or similar). Specifically, apply timeout
to the with_failover() await and to the sync_metagraph(&client, netuid).await
calls, handle the Result from timeout (map Err(elapsed) -> anyhow error), and
keep the rest of try_refresh (new_hotkeys population and hotkeys write)
unchanged.
| async fn try_refresh(&self, netuid: u16, min_stake_tao: f64) -> anyhow::Result<usize> { | ||
| use bittensor_rs::ss58::encode_ss58; | ||
|
|
||
| let client = bittensor_rs::BittensorClient::with_failover() | ||
| .await | ||
| .map_err(|e| anyhow::anyhow!("Failed to connect to subtensor: {}", e))?; | ||
|
|
||
| let metagraph = bittensor_rs::sync_metagraph(&client, netuid) | ||
| .await | ||
| .map_err(|e| anyhow::anyhow!("Failed to sync metagraph: {}", e))?; | ||
|
|
||
| let mut new_hotkeys = HashSet::new(); | ||
| for neuron in metagraph.neurons.values() { | ||
| if neuron.validator_permit && neuron.active && neuron.stake.as_tao() >= min_stake_tao { | ||
| new_hotkeys.insert(encode_ss58(&neuron.hotkey)); | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check Cargo.toml for bittensor-rs dependency
echo "=== Checking Cargo.toml for bittensor-rs dependency ==="
rg -A 3 'bittensor-rs' Cargo.toml
echo ""
echo "=== Checking Cargo.lock for bittensor-rs revision ==="
rg -A 10 'name = "bittensor-rs"' Cargo.lock
echo ""
echo "=== Checking validator_whitelist.rs for API usage ==="
cat -n src/validator_whitelist.rs | head -100Repository: PlatformNetwork/term-executor
Length of output: 4507
🏁 Script executed:
#!/bin/bash
# Check if we can inspect the bittensor-rs APIs at the pinned commit
echo "=== Searching for bittensor-rs function/struct definitions online or in cache ==="
# Try to find information about bittensor-rs API
# First, let's check if there are any other Rust files that import from bittensor-rs
rg -n 'use bittensor_rs::' src/ -A 2
echo ""
echo "=== Check if there are any documentation or comments about the APIs ==="
rg -B 2 -A 2 'with_failover|sync_metagraph|validator_permit' src/Repository: PlatformNetwork/term-executor
Length of output: 2093
🏁 Script executed:
#!/bin/bash
echo "=== Check if Cargo.lock exists and is tracked ==="
git ls-files | grep -E "Cargo\.lock|Cargo\.toml"
echo ""
echo "=== Check git history of Cargo.lock if it exists ==="
git log --oneline Cargo.lock 2>/dev/null | head -5 || echo "Cargo.lock not in git history or doesn't exist"
echo ""
echo "=== Check if there's a .gitignore entry for Cargo.lock ==="
cat .gitignore 2>/dev/null | grep -i "cargo.lock" || echo "No explicit Cargo.lock entry in .gitignore"Repository: PlatformNetwork/term-executor
Length of output: 412
Pin bittensor-rs dependency to a specific commit instead of unpinned branch = "main".
The Cargo.toml uses branch = "main" without a pinned commit. While Cargo.lock is currently tracked and provides reproducibility, running cargo update could pull breaking changes from bittensor-rs main, affecting APIs like BittensorClient::with_failover(), sync_metagraph(), and neuron fields. Pin to a specific commit (e.g., rev = "eb58916af5a4d7fef74ef00ea0d61519880b101f") to ensure stability.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/validator_whitelist.rs` around lines 62 - 77, The dependency for
bittensor-rs is currently tracked by branch="main" which can introduce breaking
API changes affecting BittensorClient::with_failover, sync_metagraph, and neuron
field access; update Cargo.toml to replace the unpinned branch with a specific
commit revision (rev = "<commit-hash>", e.g.
eb58916af5a4d7fef74ef00ea0d61519880b101f) so builds are reproducible and APIs
used in try_refresh (BittensorClient::with_failover, sync_metagraph,
neuron.validator_permit, neuron.active, neuron.stake) remain stable. Ensure you
run cargo update -p bittensor-rs if needed and check Cargo.lock reflects the
pinned rev.
Summary
Replace the static single-hotkey + API-key authentication system with dynamic validator whitelisting from the Bittensor blockchain (netuid 100) and a 50% consensus mechanism for evaluation triggering. Validators with ≥10,000 TAO stake are automatically whitelisted via
bittensor-rs, and POST/submitevaluations only proceed when a majority of whitelisted validators agree on the same archive payload.Changes
New modules
src/validator_whitelist.rs—ValidatorWhitelistbacked byparking_lot::RwLock<HashSet<String>>, with a background refresh loop (every 5 min) that connects to subtensor viaBittensorClient::with_failover(), syncs the metagraph for netuid 100, and filters validators byvalidator_permit && active && stake >= 10k TAO. Includes retry with exponential backoff (3 attempts) and graceful fallback to cached whitelist on connection failure.src/consensus.rs—ConsensusManagerusingDashMap<String, PendingConsensus>keyed by SHA-256 archive hash. Tracks per-hash voter sets, enforces 50% threshold before triggering evaluation, prevents duplicate votes, caps pending entries at 100, and runs a 30-second TTL reaper for expired entries (default 60s TTL).Modified modules
src/auth.rs— RemovedAUTHORIZED_HOTKEYimport,X-Api-Keyheader extraction,InvalidApiKeyerror variant, and API key comparison.verify_request()now accepts a&ValidatorWhitelistreference and checks hotkey membership in the dynamic whitelist.AuthHeadersreduced from 4 fields to 3.src/config.rs— RemovedAUTHORIZED_HOTKEYconstant andworker_api_keyfield (no longer requiresWORKER_API_KEYenv var). Added configurable fields:bittensor_netuid,min_validator_stake_tao,validator_refresh_secs,consensus_threshold,consensus_ttl_secswith sensible defaults and env var overrides.src/handlers.rs—AppStateextended withValidatorWhitelistandConsensusManager.submit_batchhandler now: rejects requests when whitelist is empty (503), computes SHA-256 of archive, records consensus votes, returns 202 with pending status or triggers evaluation on threshold reached.src/main.rs— Added module declarations, creates and spawns whitelist refresh loop and consensus reaper as background tasks, wires new state intoAppState.Cargo.toml— Addedbittensor-rsgit dependency and required[patch.crates-io]forw3f-bls.Dockerfile— Addedprotobuf-compiler,cmake,clangbuild dependencies and.cargoconfig copy forbittensor-rscompilation.Documentation
AGENTS.mdandsrc/AGENTS.md— Updated data flow, module map, shared state descriptions, environment variables, and authentication documentation.Breaking Changes
X-Api-Keyheader is no longer accepted or requiredWORKER_API_KEYenvironment variable is no longer neededAUTHORIZED_HOTKEYreplaced by dynamic whitelist — only Bittensor netuid 100 validators with ≥10k TAO stake are authorizedPOST /submitnow returns 202 withpending_consensusstatus until ≥50% of validators submit the same archiveSummary by CodeRabbit
Release Notes
New Features
Refactor