Skip to content

[CRE] [1/5] Gateway handler for confidential relay#21638

Open
nadahalli wants to merge 25 commits intodevelopfrom
tejaswi/cw-1-gateway-handler
Open

[CRE] [1/5] Gateway handler for confidential relay#21638
nadahalli wants to merge 25 commits intodevelopfrom
tejaswi/cw-1-gateway-handler

Conversation

@nadahalli
Copy link
Copy Markdown
Contributor

@nadahalli nadahalli commented Mar 23, 2026

Note on duplication with vault handler

This handler shares ~200 lines of structural code with the vault handler (activeRequest tracking, fanOutToNodes, sendResponse, errorResponse, metrics, HandleNodeMessage skeleton). Extracting a shared base type is a natural follow-up but out of scope here to avoid touching the vault handler.

Context

Part of #21635 (confidential workflow execution). [1/5] in the series.
Can be reviewed and merged independently.

What this does

Adds a new gateway handler type confidential-compute-relay that accepts
JSON-RPC requests from enclaves, fans them out to relay DON nodes, and
aggregates responses using F+1 quorum. Supports secrets_get and
capability_exec methods.

F+1 is sufficient because each relay DON node independently calls the
target DON (Vault or capability) through CRE's standard capability
dispatch, which includes DON-level consensus. Each honest relay node
receives the same consensus-aggregated response, then performs
deterministic translation (hex to base64 encoding, JSON marshalling)
before forwarding it back through the gateway. Since honest relay
nodes produce byte-identical responses, F+1 matching guarantees at
least one honest node vouched for the result.

See #21639 ([2/5]) for the relay DON node handler that processes these
fanned-out requests.

Dependencies

None. This PR is self-contained.

Copilot AI review requested due to automatic review settings March 23, 2026 17:00
@nadahalli nadahalli requested review from a team as code owners March 23, 2026 17:00
@github-actions
Copy link
Copy Markdown
Contributor

👋 nadahalli, thanks for creating this pull request!

To help reviewers, please consider creating future PRs as drafts first. This allows you to self-review and make any final changes before notifying the team.

Once you're ready, you can mark it as "Ready for review" to request feedback. Thanks!

@github-actions
Copy link
Copy Markdown
Contributor

I see you updated files related to core. Please run make gocs in the root directory to add a changeset as well as in the text include at least one of the following tags:

  • #added For any new functionality added.
  • #breaking_change For any functionality that requires manual action for the node to boot.
  • #bugfix For bug fixes.
  • #changed For any change to the existing functionality.
  • #db_update For any feature that introduces updates to database schema.
  • #deprecation_notice For any upcoming deprecation functionality.
  • #internal For changesets that need to be excluded from the final changelog.
  • #nops For any feature that is NOP facing and needs to be in the official Release Notes for the release.
  • #removed For any functionality/config that is removed.
  • #updated For any functionality that is updated.
  • #wip For any change that is not ready yet and external communication about it should be held off till it is feature complete.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 23, 2026

✅ No conflicts with other open PRs targeting develop

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Risk Rating: MEDIUM

Adds a new Gateway handler type (confidential-compute-relay) to accept JSON-RPC requests from enclaves, fan them out to relay DON nodes, and aggregate node responses via an F+1 quorum.

Changes:

  • Register new handler type/service name in deployment job generation and the gateway handler factory.
  • Implement confidential relay gateway handler + F+1 quorum aggregator.
  • Add unit tests for fan-out, quorum behavior, timeouts, duplicate IDs, and node rate limiting.

Reviewed changes

Copilot reviewed 7 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
go.mod Bumps deps needed for the new handler integration.
go.sum Corresponding checksum updates.
deployment/go.mod Aligns deployment module deps with root changes.
deployment/go.sum Corresponding checksum updates for deployment module.
deployment/cre/jobs/pkg/gateway_job.go Adds handler type + default config wiring for gateway job specs.
core/services/gateway/handler_factory.go Registers the new handler type in the factory.
core/services/gateway/handlers/confidentialrelay/handler.go New gateway handler: request tracking, fan-out, rate limiting, timeouts, metrics, callback response.
core/services/gateway/handlers/confidentialrelay/aggregator.go Implements F+1 digest-based quorum selection and “quorum unobtainable” detection.
core/services/gateway/handlers/confidentialrelay/handler_test.go Coverage for quorum success/divergence, timeouts, duplicate IDs, and rate limiting behavior.

Scrupulous human review areas:

  • Timeout behavior across layers (gateway request timeout vs handler requestTimeoutSec) and its impact on activeRequests growth.
  • Quorum aggregation correctness (digest semantics, behavior with errors/invalid digests).
  • Rate-limiting behavior (dropping node responses) and its effect on quorum attainability.

Reviewer recommendations (per CODEOWNERS):

  • @smartcontractkit/core and @smartcontractkit/foundations (repo-wide owners; go.mod/go.sum also owned by these teams).
  • For deployment changes under /deployment/cre: @smartcontractkit/keystone and @smartcontractkit/operations-platform.

Comment on lines +453 to +463
func newDefaultConfidentialRelayHandler() handler {
return handler{
Name: GatewayHandlerTypeConfidentialRelay,
ServiceName: "confidential",
Config: confidentialRelayHandlerConfig{
NodeRateLimiter: nodeRateLimiterConfig{
GlobalBurst: 10,
GlobalRPS: 50,
PerSenderBurst: 10,
PerSenderRPS: 10,
},
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newDefaultConfidentialRelayHandler doesn’t set requestTimeoutSec, so the handler will fall back to its internal default (30s). If the gateway-level request timeout is configured lower (the job enforces a minimum of 5s), the gateway can time out first while the handler keeps the request in activeRequests until its own timeout/cleanup, which can cause unnecessary memory growth under load. Consider adding a RequestTimeoutSec field (similar to vaultHandlerConfig) and setting it to g.RequestTimeoutSec - 1 when constructing the default config so the handler always times out before the gateway.

Also, this handler hardcodes ServiceName: "confidential" instead of using ServiceNameConfidential, which risks future drift if the constant changes.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@nadahalli nadahalli requested a review from a team as a code owner March 23, 2026 17:18
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 23, 2026

CORA - Pending Reviewers

Codeowners Entry Overall Num Files Owners
* 💬 4 @smartcontractkit/foundations, @smartcontractkit/core
/deployment/cre 💬 2 @smartcontractkit/keystone, @smartcontractkit/operations-platform
go.mod 💬 6 @smartcontractkit/core, @smartcontractkit/foundations
go.sum 💬 6 @smartcontractkit/core, @smartcontractkit/foundations
integration-tests/go.mod 💬 1 @smartcontractkit/core, @smartcontractkit/devex-tooling, @smartcontractkit/foundations
integration-tests/go.sum 💬 1 @smartcontractkit/core, @smartcontractkit/devex-tooling, @smartcontractkit/foundations

Legend: ✅ Approved | ❌ Changes Requested | 💬 Commented | 🚫 Dismissed | ⏳ Pending | ❓ Unknown

For more details, see the full review summary.

@nadahalli nadahalli force-pushed the tejaswi/cw-1-gateway-handler branch from 985dd59 to 52a1f51 Compare March 23, 2026 17:25
@trunk-io
Copy link
Copy Markdown

trunk-io bot commented Mar 23, 2026

Static BadgeStatic BadgeStatic BadgeStatic Badge

View Full Report ↗︎Docs

@nadahalli nadahalli force-pushed the tejaswi/cw-1-gateway-handler branch from eb015a2 to f3bd818 Compare March 23, 2026 19:08
github.com/smartcontractkit/chainlink-automation v0.8.1
github.com/smartcontractkit/chainlink-ccip v0.1.1-solana.0.20260317185256-d5f7db87ae70
github.com/smartcontractkit/chainlink-common v0.11.0
github.com/smartcontractkit/chainlink-common v0.11.1-0.20260323163826-2c5b95089478
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a canonical branch or a temporary branch?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a temporary branch. I have to wait for the new canonical branch to be cut - not sure when. It must be v0.12.0.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently linking to commit hashes is ok.

Copy link
Copy Markdown
Contributor

@vreff vreff Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah no problem with commit hash, just making sure this hash is on the main branch in chainlink-common.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this was merged as a part of chainlink-common#1903

Name: GatewayHandlerTypeConfidentialRelay,
ServiceName: ServiceNameConfidential,
Config: confidentialRelayHandlerConfig{
RequestTimeoutSec: requestTimeoutSec - 1,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why - 1?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copied from the vault handler code. They even have a comment explaining why:

// must be lower than the overall gateway request timeout. // so we allow for the response to be sent back.

I'll add the same comment here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a problematic config generally, especially to be copied over along many services. Where did this come from? cc @bolekk

At the very least we could use a global buffer number instead of "1". Not particularly in scope for this PR.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a very weird mathematics - it is very confusing to get this override !inside! the newDefaultConfidentialRelayHandler which only takes one parameter requestTimeoutSec. Could you please move this -1 update to the place where the function is called? (Something like https://github.com/smartcontractkit/chainlink/pull/20087/changes did.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Moved the -1 to the call site with a comment explaining why. Also added a TODO to unify with the vault handler which does the same subtraction internally.

vreff
vreff previously approved these changes Mar 24, 2026

h.mu.Lock()
defer h.mu.Unlock()
delete(h.activeRequests, userRequest.req.ID)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we cleaning up active requests that never get serviced? Else there is a minor memory leak.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. removeExpiredRequests removes them.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, could you please say which line in removeExpiredRequests deletes them?

Copy link
Copy Markdown
Contributor Author

@nadahalli nadahalli Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 226 in core/services/gateway/handlers/confidentialrelay/handler.go has a sendResponse call, which actually deletes them. I had added a comment above that call to explicitly call this out.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so if sendResponse errs at userRequest.SendResponse(resp), what happens to these requests? will they ever be cleaned?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 this was my concern is that there is no default garbage collection.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done now. Renamed the method to sendResponseAndCleanup.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I am still not sure how the overall message flow works:

Currently sendResponseAndCleanup just sends the response and ignoring the send result deletes it from the active requests. This is applicable for both success and error responses. Is this acceptable behavior for the application? (For example, http_trigger_handler uses sendWithRetries to ensure that the send operation is retried in case of errors].)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replied below in the main thread.

remainingResponses := donMembersCount - len(resps)
if maxShaToCount+remainingResponses < requiredQuorum {
l.Warnw("quorum unattainable for request", "requiredQuorum", requiredQuorum, "remainingResponses", remainingResponses, "maxShaToCount", maxShaToCount)
return nil, errors.New(errQuorumUnobtainable.Error() + ". RequiredQuorum=" + strconv.Itoa(requiredQuorum) + ". maxShaToCount=" + strconv.Itoa(maxShaToCount) + " remainingResponses=" + strconv.Itoa(remainingResponses))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason you avoid fmt.Errorf ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in an earlier commit. All errors.New with string concatenation replaced with fmt.Errorf throughout the handler and aggregator.

func (a *aggregator) Aggregate(resps map[string]jsonrpc.Response[json.RawMessage], donF int, donMembersCount int, l logger.Logger) (*jsonrpc.Response[json.RawMessage], error) {
// F+1 is sufficient: each honest node independently validates the enclave's
// Nitro attestation, so F+1 matching responses guarantees at least one
// honest node vouched for the result.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this comment does not really explain what is going on. It just says that out of F+1 replies at least 1 is honest - this is understandable, but the surrounding logic (that the honest parties cannot disagree is not explained).

For example, here https://github.com/smartcontractkit/libocr/blob/a03701e2c02e2331921bfa6887e2257dea4e6084/quorumhelper/quorumhelper.go#L8 we have multiple quorums that can be used. Is there a design doc where this step is explained?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, and thanks for the libocr quorumhelper link. This is QuorumFPlusOne in libocr terms.

F+1 is correct here because each relay DON node calls the target DON (Vault or capability) through CRE's standard capability dispatch, which includes DON-level consensus. Every honest relay node receives the same consensus-aggregated response, then performs deterministic translation (hex to base64, JSON marshalling). So honest relay nodes produce byte-identical gateway responses. F+1 matching guarantees at least one honest node vouched for the result.

The relay DON node handler is in #21639 ([2/5]) if you want to see the request handling. Updated the comment to explain this.

) gwhandlers.UserCallbackPayload {
switch errorCode {
case api.FatalError:
case api.NodeReponseEncodingError:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, what is the purpose of this error parsing? Are you saying that the error itself does not contain enough information and need to be augmented?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mirrors the vault handler's existing pattern (vault/handler.go has the same switch). The gateway framework uses api.ErrorCode to control what goes back over JSON-RPC vs what stays in logs. For example, NodeReponseEncodingError logs the real error but sends back a generic string to the caller. It's the gateway's convention for sanitizing internal errors on the wire.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I see that vault handler also does that. But I don't see other handlers doing that. How come? (or am I missing something)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The capabilities handler (capabilities/handler.go) inlines its error encoding at each call site rather than extracting a shared method. The vault and relay handlers have more error code variants and wanted consistent logging, so they centralized it. Same underlying logic, just different code organization.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, given that you have both sendResponseAndCleanup and errorResponse can send response with errors, I would propose to keep only one function - sendResponseAndCleanup.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am very sorry, this is still not good. You have the function called sendSuccessResponseAndCleanup which can still send error if response does not encode successfully. Could you please make one function sendResponseAndCleanup that treats all the responses (also with errors) equally?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done :-)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, it was still too complicated - I have made the last commit in this branch to simplify sendResponseAndCleanup. If this is fine with you please ack it here, otherwise feel free to drop it and we can check again how sendResponseAndCleanup can be refactored.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am happy to accept the change. Thanks for taking the time to send the commit.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who can review the code now? We need someone from core, foundations, keystone, and/or operations-platform.

Comment on lines 351 to 352
if nodeErrors == len(h.donConfig.Members) && nodeErrors > 0 {
return h.sendResponseAndCleanup(ctx, ar, h.constructErrorResponse(ar.req, api.FatalError, errors.New("failed to forward user request to nodes")))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain the error strategy here, why do we only clean up if all nodes have errored? Should we not clean up if F+1 nodes error?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, waiting for all nodes to error is a bit too much. But F+1 would to too aggressive as a quorum is still possible after F+1 failures. In our case F+1 is the success quorum. So, if that's not possible - we error. Changed code to reflect that.

Comment on lines +331 to +332
nodeErrors int
nodeErrorsMu sync.Mutex
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit use sync/atomic.Uint32

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

nadahalli and others added 24 commits April 2, 2026 16:14
Add a new gateway handler type "confidential-compute-relay" that fans out
enclave JSON-RPC requests to relay DON nodes and aggregates responses
using F+1 quorum. Supports secrets_get and capability_exec methods.

Part of #21635
Add missing api.ConflictError and api.LimitExceededError cases to both
switch statements in handler.go. Run go mod tidy on integration-tests
and integration-tests/load.
- Replace string literals with ServiceNameVault, ServiceNameWorkflows,
  ServiceNameConfidential constants in all handler constructors.
- Add RequestTimeoutSec to confidentialRelayHandlerConfig, set to
  gateway timeout minus 1s (matching vault handler pattern). Ensures
  the handler times out before the gateway, returning a clean error
  instead of the gateway killing the connection.
Replace errors.New(x.Error() + ...) and fmt.Sprintf + errors.New
patterns with fmt.Errorf throughout the relay handler and aggregator.
Use %w for error wrapping where appropriate.
Add comment clarifying sendResponse deletes expired requests.
Explain why F+1 is correct: relay nodes proxy already-aggregated DON
responses through deterministic translation, so honest nodes produce
byte-identical outputs.
Make the buffer visible where handlers are wired up instead of hiding
it inside the constructor. The vault handler does the same subtraction
internally; a follow-up should unify both to use this pattern.
sendResponse no longer has the side effect of deleting from
activeRequests. Callers explicitly call deleteActiveRequest after
sendResponse, making the cleanup visible at every call site.
The old sendResponse skipped the delete if SendResponse failed, leaving
the request in activeRequests forever. Now the delete always runs
regardless of send outcome. The method name makes the cleanup explicit.
@nadahalli nadahalli force-pushed the tejaswi/cw-1-gateway-handler branch from 0ff10e7 to 7d1d54f Compare April 2, 2026 14:18
@cl-sonarqube-production
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants