Skip to content

app_main: fix commissioning failure — SRP record not updated after ad…#44

Closed
GuvHas wants to merge 2 commits intomainfrom
claude/review-esp32-thread-code-EsKdn
Closed

app_main: fix commissioning failure — SRP record not updated after ad…#44
GuvHas wants to merge 2 commits intomainfrom
claude/review-esp32-thread-code-EsKdn

Conversation

@GuvHas
Copy link
Copy Markdown
Owner

@GuvHas GuvHas commented Mar 20, 2026

During commissioning the controller's "Reconnect" step (step 18) needs to resolve the device's NEW operational address via mDNS before it can open a CASE session. The mDNS record is sourced from the OTBR's SRP proxy.

Root cause: trySrpServiceAdd() ran when Thread first attached (via the onThreadStateChanged callback) and built the SRP instance name from the FIRST fabric in the table at that time (either an old committed fabric from a previous session, or nothing if no fabric existed yet). When addNoc subsequently committed a new fabric with a different CompressedFabricId / NodeId, the OTBR SRP record still carried the old instance name. DNS-SD resolution for the new -._matter._tcp name timed out (~35 s) and commissioning failed with "something went wrong".

Fix: add a FabricTable::Delegate (SrpFabricDelegate) that overrides OnFabricCommitted. It fires synchronously on the CHIP/OpenThread task immediately after addNoc persists the new fabric, giving enough time to update the SRP record before the CASE reconnect DNS-SD lookup is issued.

The delegate:

  1. Clears any stale SRP service (otSrpClientClearHostAndServices — avoids the OnSrpClientNotification crash fixed in the previous commit).
  2. Calls trySrpServiceAdd() with the specific newly committed FabricIndex so FindFabricWithIndex() picks the right fabric even when an older fabric still occupies index 1.

trySrpServiceAdd() gains an optional preferIndex parameter for this path; the onThreadStateChanged / startup paths continue to use the first-fabric fallback as before.

The delegate is registered in the startup ScheduleWork after chip::Server is initialised.

…dNoc

During commissioning the controller's "Reconnect" step (step 18) needs to
resolve the device's NEW operational address via mDNS before it can open a
CASE session.  The mDNS record is sourced from the OTBR's SRP proxy.

Root cause: trySrpServiceAdd() ran when Thread first attached (via the
onThreadStateChanged callback) and built the SRP instance name from the
FIRST fabric in the table at that time (either an old committed fabric from
a previous session, or nothing if no fabric existed yet).  When addNoc
subsequently committed a new fabric with a different CompressedFabricId /
NodeId, the OTBR SRP record still carried the old instance name.  DNS-SD
resolution for the new <CFID>-<NodeId>._matter._tcp name timed out (~35 s)
and commissioning failed with "something went wrong".

Fix: add a FabricTable::Delegate (SrpFabricDelegate) that overrides
OnFabricCommitted.  It fires synchronously on the CHIP/OpenThread task
immediately after addNoc persists the new fabric, giving enough time to
update the SRP record before the CASE reconnect DNS-SD lookup is issued.

The delegate:
  1. Clears any stale SRP service (otSrpClientClearHostAndServices — avoids
     the OnSrpClientNotification crash fixed in the previous commit).
  2. Calls trySrpServiceAdd() with the specific newly committed FabricIndex
     so FindFabricWithIndex() picks the right fabric even when an older
     fabric still occupies index 1.

trySrpServiceAdd() gains an optional preferIndex parameter for this path;
the onThreadStateChanged / startup paths continue to use the first-fabric
fallback as before.

The delegate is registered in the startup ScheduleWork after chip::Server
is initialised.

https://claude.ai/code/session_01SxqAiQApiRTwXefk9aQ6bR
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 22a77f48a0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

// Use Clear (not RemoveHostAndServices) — see srpServiceRemove() for why.
if (s_srp.added) {
s_srp.added = false;
otSrpClientClearHostAndServices(ot);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reapply SRP host config after clearing host/services

In SrpFabricDelegate::OnFabricCommitted, calling otSrpClientClearHostAndServices() drops the SRP host metadata along with services, but the follow-up path only calls trySrpServiceAdd() (which just adds a service). That means recommissioning cases where s_srp.added is already true can lose the hostname/host-address state set at startup and fail to publish the new operational _matter._tcp instance, so DNS-SD/CASE reconnect can still time out.

Useful? React with 👍 / 👎.

otSrpClientClearHostAndServices() resets the SRP client's host state
(hostname pointer → null, auto-host-address mode disabled) in addition
to removing services.  The previous commit called it in
SrpFabricDelegate::OnFabricCommitted but then only called
trySrpServiceAdd(), which goes straight to otSrpClientAddService().
Without a configured host, the SRP client rejects AddService or silently
fails to send an update, so the new <CFID>-<NodeId>._matter._tcp record
never reaches OTBR and CASE reconnect still times out.

Fix:
- Extract setupSrpHost(otInstance*): sets hostname and re-enables
  auto-host-address mode using the buffer in s_srp.
- Move the hostname buffer from a lambda-local static into SrpCtx
  (s_srp.hostname) so setupSrpHost() and the delegate share one buffer.
- Call setupSrpHost() in OnFabricCommitted immediately after Clear and
  before trySrpServiceAdd().
- Startup lambda uses setupSrpHost() too, eliminating the duplicate
  snprintf/SetHostName/EnableAutoHostAddress block.

https://claude.ai/code/session_01SxqAiQApiRTwXefk9aQ6bR
@GuvHas GuvHas closed this Mar 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants