Skip to content

[CRE] [2/5] Relay DON node handler for confidential relay#21639

Open
nadahalli wants to merge 12 commits intodevelopfrom
tejaswi/cw-2-relay-handler
Open

[CRE] [2/5] Relay DON node handler for confidential relay#21639
nadahalli wants to merge 12 commits intodevelopfrom
tejaswi/cw-2-relay-handler

Conversation

@nadahalli
Copy link
Copy Markdown
Contributor

@nadahalli nadahalli commented Mar 23, 2026

Context

Part of #21635 (confidential workflow execution). [2/5] in the series.
Can be reviewed and merged independently.

What this does

Adds a GatewayConnectorHandler that runs on relay DON nodes. It
validates Nitro attestation on incoming enclave requests, then proxies
to VaultDON (secrets_get) or capability DONs (capability_exec).

Each relay node calls the target DON through CRE's standard capability
dispatch (which includes DON-level consensus), receives the aggregated
response, and translates it deterministically before returning to the
gateway. The gateway handler (#21638, [1/5]) aggregates these responses
using F+1 quorum.

Supports multi-PCR validation for enclave pools where each instance
has unique PCR values.

  • Config: CRE.ConfidentialRelay TOML block (enabled, trustedPCRs,
    caRootsPEM) for relay DON node configuration.

Dependencies

None. This PR is self-contained.

Copilot AI review requested due to automatic review settings March 23, 2026 17:00
@nadahalli nadahalli requested review from a team as code owners March 23, 2026 17:00
@github-actions
Copy link
Copy Markdown
Contributor

👋 nadahalli, thanks for creating this pull request!

To help reviewers, please consider creating future PRs as drafts first. This allows you to self-review and make any final changes before notifying the team.

Once you're ready, you can mark it as "Ready for review" to request feedback. Thanks!

@github-actions
Copy link
Copy Markdown
Contributor

I see you updated files related to core. Please run make gocs in the root directory to add a changeset as well as in the text include at least one of the following tags:

  • #added For any new functionality added.
  • #breaking_change For any functionality that requires manual action for the node to boot.
  • #bugfix For bug fixes.
  • #changed For any change to the existing functionality.
  • #db_update For any feature that introduces updates to database schema.
  • #deprecation_notice For any upcoming deprecation functionality.
  • #internal For changesets that need to be excluded from the final changelog.
  • #nops For any feature that is NOP facing and needs to be in the official Release Notes for the release.
  • #removed For any functionality/config that is removed.
  • #updated For any functionality that is updated.
  • #wip For any change that is not ready yet and external communication about it should be held off till it is feature complete.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 23, 2026

✅ No conflicts with other open PRs targeting develop

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Risk Rating: HIGH

Adds confidential-relay request handling across the gateway and relay DON nodes as part of confidential workflow execution (#21635), including quorum aggregation on the gateway side and Nitro-attestation validation on the relay node side before proxying to Vault/capability execution.

Changes:

  • Introduces a new gateway handler that fans out confidential relay requests to relay DON nodes and aggregates node responses to an F+1 quorum.
  • Introduces a new relay DON node GatewayConnectorHandler that validates Nitro attestations (including multi-PCR pools) and proxies secrets_get/capability_exec to Vault/capabilities.
  • Updates dependencies to include attestation-related packages (chainlink-common/pkg/teeattestation, hf/nitrite) and bumps fxamacker/cbor.

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
go.mod Bumps/adds dependencies needed for TEE attestation + relay protocol types.
go.sum Records new/updated module checksums for the dependency changes.
core/services/gateway/handlers/confidentialrelay/handler.go New gateway-side fanout + quorum aggregation handler with metrics and request lifecycle management.
core/services/gateway/handlers/confidentialrelay/aggregator.go New quorum aggregator for matching node responses (digest-based) with “quorum unobtainable” detection.
core/services/gateway/handlers/confidentialrelay/handler_test.go Unit tests covering quorum success/divergence, timeouts, duplicate IDs, fanout failures, and rate-limiting.
core/capabilities/confidentialrelay/handler.go New relay DON node handler validating Nitro attestation and proxying to Vault/capability execution.
core/capabilities/confidentialrelay/handler_test.go Unit tests for relay handler behavior (attestation, capability exec, unsupported methods, invalid params).
core/capabilities/confidentialrelay/service.go Lifecycle wrapper that defers handler creation until the gateway connector is available.

Comment on lines +163 to +167
func (h *Handler) handleSecretsGet(ctx context.Context, gatewayID string, req *jsonrpc.Request[json.RawMessage]) *jsonrpc.Response[json.RawMessage] {
var params confidentialrelaytypes.SecretsRequestParams
if err := json.Unmarshal(*req.Params, &params); err != nil {
return h.errorResponse(ctx, gatewayID, req, jsonrpc.ErrInvalidParams, err)
}
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handleSecretsGet dereferences *req.Params without checking for req.Params == nil. A JSON-RPC request can omit params, which would panic here. Add a nil check and return a JSON-RPC invalid request/invalid params error response instead of crashing the handler.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

func (h *Handler) handleCapabilityExecute(ctx context.Context, gatewayID string, req *jsonrpc.Request[json.RawMessage]) *jsonrpc.Response[json.RawMessage] {
var params confidentialrelaytypes.CapabilityRequestParams
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handleCapabilityExecute dereferences *req.Params without checking for req.Params == nil, which can panic for JSON-RPC requests that omit params. Add a nil check and return an appropriate JSON-RPC invalid request/invalid params response.

Suggested change
var params confidentialrelaytypes.CapabilityRequestParams
var params confidentialrelaytypes.CapabilityRequestParams
if req.Params == nil {
return h.errorResponse(ctx, gatewayID, req, jsonrpc.ErrInvalidParams, fmt.Errorf("missing params"))
}

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +180 to +182
// Normalize owner to EIP-55 checksum format, matching how secrets are stored.
normalizedOwner := common.HexToAddress(params.Owner).Hex()

Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

common.HexToAddress(params.Owner) will silently coerce invalid/empty owner strings to the zero address. If params.Owner is malformed, this could query Vault under an unintended owner without surfacing an error. Consider validating params.Owner first (e.g., common.IsHexAddress) and returning an ErrInvalidParams response when it’s not a valid EVM address.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +157 to +159
h.metrics.requestSuccess.Add(ctx, 1, metric.WithAttributes(
attribute.String("gateway_id", gatewayID),
))
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HandleGatewayMessage increments requestSuccess unconditionally after sending the response, even when the response contains a JSON-RPC error (and errorResponse already increments requestInternalError). This will skew success/error rates in telemetry. Consider incrementing success only when response.Error == nil (and using the error counter otherwise).

Suggested change
h.metrics.requestSuccess.Add(ctx, 1, metric.WithAttributes(
attribute.String("gateway_id", gatewayID),
))
if response != nil && response.Error == nil {
h.metrics.requestSuccess.Add(ctx, 1, metric.WithAttributes(
attribute.String("gateway_id", gatewayID),
))
}

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +1 to +5
package confidentialrelay

import (
"context"
"encoding/json"
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR title/description emphasize the relay DON node handler, but this PR also introduces a new gateway-side handler (core/services/gateway/handlers/confidentialrelay/*) that fans out and aggregates responses. If that’s intentional, consider updating the PR description (or splitting) so reviewers don’t miss the gateway-surface changes.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. The gateway handler files were erroneously included in this PR. They belong to PR #21638 [1/5]. Removed.

l.Debugw("aggregating responses, waiting for other nodes...", "error", err)
return nil
case err != nil:
l.Error("quorum unobtainable, returning response to user...", "error", err, "responses", maps.Values(ar.responses))
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maps.Values(ar.responses) reads the underlying map without holding ar.mu. Since ar.responses is mutated under addResponseForNode, this log line can trigger a data race / concurrent map iteration panic under load. Use copiedResponses := ar.copiedResponses() (or lock around the read) and log the copied map/values instead.

Suggested change
l.Error("quorum unobtainable, returning response to user...", "error", err, "responses", maps.Values(ar.responses))
l.Error("quorum unobtainable, returning response to user...", "error", err, "responses", copiedResponses)

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file was erroneously included and has been removed from this PR. Tracked in #21638.

Comment on lines +241 to +257
func (h *handler) HandleJSONRPCUserMessage(ctx context.Context, req jsonrpc.Request[json.RawMessage], callback gwhandlers.Callback) error {
if req.ID == "" {
return errors.New("request ID cannot be empty")
}
if len(req.ID) > 200 {
return errors.New("request ID is too long: " + strconv.Itoa(len(req.ID)) + ". max is 200 characters")
}

l := logger.With(h.lggr, "method", req.Method, "requestID", req.ID)
l.Debugw("handling confidential relay request")

ar, err := h.newActiveRequest(req, callback)
if err != nil {
return err
}

return h.fanOutToNodes(ctx, l, ar)
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HandleJSONRPCUserMessage forwards any req.Method without validation and doesn’t guard against req.Params == nil. Other gateway handlers validate supported methods and reject missing params (e.g., core/services/gateway/handlers/vault/handler.go:336-396 and core/services/gateway/handlers/capabilities/v2/http_trigger_handler.go:127-131). Consider failing fast here with UnsupportedMethodError/InvalidParamsError to avoid panics later (notably errorResponse dereferences *req.Params).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file was erroneously included and has been removed from this PR. Tracked in #21638.

@nadahalli nadahalli requested a review from a team as a code owner March 23, 2026 17:19
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 23, 2026

CORA - Pending Reviewers

Codeowners Entry Overall Num Files Owners
* 💬 8 @smartcontractkit/foundations, @smartcontractkit/core
/core/capabilities/ 💬 3 @smartcontractkit/keystone, @smartcontractkit/capabilities-team
/core/web/resolver 💬 3 @smartcontractkit/foundations, @smartcontractkit/core
go.md 💬 1 @smartcontractkit/core, @smartcontractkit/foundations
go.mod 💬 6 @smartcontractkit/core, @smartcontractkit/foundations
go.sum 💬 6 @smartcontractkit/core, @smartcontractkit/foundations
integration-tests/go.mod 💬 1 @smartcontractkit/core, @smartcontractkit/devex-tooling, @smartcontractkit/foundations
integration-tests/go.sum 💬 1 @smartcontractkit/core, @smartcontractkit/devex-tooling, @smartcontractkit/foundations

Legend: ✅ Approved | ❌ Changes Requested | 💬 Commented | 🚫 Dismissed | ⏳ Pending | ❓ Unknown

For more details, see the full review summary.

@nadahalli nadahalli force-pushed the tejaswi/cw-2-relay-handler branch 2 times, most recently from ef1dc22 to e9eb648 Compare March 23, 2026 19:56
@trunk-io
Copy link
Copy Markdown

trunk-io bot commented Mar 23, 2026

Static BadgeStatic BadgeStatic BadgeStatic Badge

Failed Test Failure Summary Logs
TestScripts/node/validate The test named 'TestScripts/node/validate' failed during execution. Logs ↗︎
TestScripts/node/validate The test 'TestScripts/node/validate' failed without providing specific error details. Logs ↗︎
TestConfigDocs The test failed because the actual documentation content did not match the expected documentation string. Logs ↗︎
TestConfigDocs The test failed because the actual documentation content did not match the expected documentation string. Logs ↗︎

... and 24 more

View Full Report ↗︎Docs

lggr logger.Logger
metrics *handlerMetrics

// validateAttestation validates Nitro attestation documents.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: nothing nitro specific about this function definition

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Replaced "Nitro" with "TEE" in the type and handler comments.

Comment on lines +33 to +34
trustedPCRs []byte,
caRootsPEM string,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need trustedPCRs as a parameter? They should be retrievable from the capRegistry parameter. The way executor.go does things is:

  • Parse the configs from the registry
  • Extract measurements from the list of enclaves.
  • Any incoming request matching those measurements is accepted.

If this is already being done outside of this logic, why not bring it within this component? I doubt the PCRs are being used anywhere else.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed. Both TrustedPCRs and CARootsPEM are now read from the capabilities registry per-request via DONsForCapability + ConfigForCapability, same pattern as CC's EnsureFreshEnclaves in executor.go. The handler looks up the confidential-workflows capability config, extracts TrustedValues and CARootsPEM from the EnclavesList, and uses them for attestation validation.

CRE.ConfidentialRelay TOML config is now just Enabled (a kill switch for relay participation).

Comment on lines +409 to +424
// Each enclave instance has different PCR values (WireGuard keys baked
// per CID), so trustedPCRs may be a JSON array of PCR objects. Try each
// set and succeed if any match, same as pool.go's
// validateAttestationAgainstMultipleMeasurements.
var pcrArray []json.RawMessage
if json.Unmarshal(h.trustedPCRs, &pcrArray) == nil && len(pcrArray) > 0 {
var validationErr error
for _, pcrs := range pcrArray {
err := h.validateAttestation(attestationBytes, hash, pcrs, h.caRootsPEM)
if err == nil {
return nil
}
validationErr = errors.Join(validationErr, err)
}
return fmt.Errorf("no trusted PCR set matched: %w", validationErr)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel like this is a bit messy for unmarshalling to be within this logic every time rather than within the constructor.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is addressed. The trustedPCRs parameter is gone. Measurements are now read from the capabilities registry per-request via getEnclaveAttestationConfig, which returns structured data directly. No JSON unmarshalling in the hot path.


// enclaveEntry mirrors the enclave config shape stored in the capabilities
// registry. Only the fields needed for attestation validation are included.
type enclaveEntry struct {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't match our actual types:

type Enclave struct {
EnclaveID [32]byte json:"enclaveId"
EnclaveURL string json:"enclaveURL"
TrustedValues [][]byte json:"trustedValues" // Array of trusted measurements the client accepts.
EnclaveType EnclaveType json:"enclaveType"
EnclaveAuthHeader string json:"enclaveAuthHeader"
EnclaveExtraData []byte json:"enclaveExtraData"
}

this whole CARootsPem thing is weird. IMO, we should not include this field in our system. I think we'd be able to tell the attestation type from the attestation itself and use the right root certificate.

At any rate, we don't have any reason to not just hardcode it to the Nitro one within chainlink common, for now.

If we do that, we should just get rid of CARootsPem everywhere in chainlink core, and assume the attestationValidatorFunc knows what to do.

Copy link
Copy Markdown
Contributor Author

@nadahalli nadahalli Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. CARootsPEM should not be in the system. I've updated the teeattestation package in chainlink-common to remove it from the production API:

smartcontractkit/chainlink-common@c7b8c137

ValidateAttestation now takes 3 args (always uses the hardcoded AWS Nitro root). ValidateAttestationWithRoots exists for tests with fake enclaves. Once chainlink-common PR #1899 merges, I'll bump common here and drop the caRootsPEM plumbing from the handler.

Re: the type mismatch with CC's Enclave struct, the handler only needs TrustedValues and CARootsPEM (soon just TrustedValues) from the enclave config. Mirroring unused fields (EnclaveID, EnclaveURL, etc.) would be dead code. Just having TrustedValues should be ok, no?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed code now.

@vreff
Copy link
Copy Markdown
Contributor

vreff commented Mar 25, 2026

LGTM apart from getting rid of the root ca pem stuff.

vreff
vreff previously approved these changes Mar 26, 2026
vreff
vreff previously approved these changes Mar 30, 2026
skudasov
skudasov previously approved these changes Mar 31, 2026
}, nil
}

type gatewayConnector interface {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit why is this a private interface yet used in the constructor. I get you may want to slim it down but clients won't be able to inspect the constructor param.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

return h.errorResponse(ctx, gatewayID, req, jsonrpc.ErrInternal, fmt.Errorf("failed to get local node: %w", err))
}

capResp, err := vaultCap.Execute(ctx, capabilities.CapabilityRequest{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No metrics on latency or count of executions?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on metrics

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added count and latency metrics for requests.

},
})
if err != nil {
return h.errorResponse(ctx, gatewayID, req, jsonrpc.ErrInternal, fmt.Errorf("vault execute failed: %w", err))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any chance the err message could leak sensitive information?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You never know. Changed it to "internal error" for safety. Logging the error though. I think we are careful enough not to send sensitive error messages from upstream. So, logging should be ok.

skudasov
skudasov previously approved these changes Apr 1, 2026
nadahalli added 11 commits April 2, 2026 16:09
Add GatewayConnectorHandler that validates Nitro attestation and proxies
enclave requests to VaultDON (secrets_get) and capability DONs
(capability_exec). Supports multi-PCR validation for enclave pools.

Part of #21635
Replace the static trustedPCRs TOML parameter with a per-request
lookup from the capabilities registry. The handler reads the
confidential-workflows capability config to extract enclave
TrustedValues, same pattern as CC's EnsureFreshEnclaves.

Remove TrustedPCRs from CREConfidentialRelay config interface,
ConfidentialRelayConfig TOML struct, and config_cre.go.
CARootsPEM is stored per-enclave in the same cap registry config as
TrustedValues. Read both in getEnclaveAttestationConfig, eliminating
the last TOML field besides Enabled from CRE.ConfidentialRelay.
ValidateAttestation in chainlink-common now always uses the hardcoded
AWS Nitro root cert (c7b8c137). Drop the caRootsPEM parameter from the
handler's attestation validator func, getEnclaveAttestationConfig, and
the enclaveEntry struct.
@nadahalli nadahalli force-pushed the tejaswi/cw-2-relay-handler branch from cb65a31 to 00a16e6 Compare April 2, 2026 14:12
@cl-sonarqube-production
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants