app_main: fix commissioning failure — SRP record not updated after ad…#44
app_main: fix commissioning failure — SRP record not updated after ad…#44
Conversation
…dNoc
During commissioning the controller's "Reconnect" step (step 18) needs to
resolve the device's NEW operational address via mDNS before it can open a
CASE session. The mDNS record is sourced from the OTBR's SRP proxy.
Root cause: trySrpServiceAdd() ran when Thread first attached (via the
onThreadStateChanged callback) and built the SRP instance name from the
FIRST fabric in the table at that time (either an old committed fabric from
a previous session, or nothing if no fabric existed yet). When addNoc
subsequently committed a new fabric with a different CompressedFabricId /
NodeId, the OTBR SRP record still carried the old instance name. DNS-SD
resolution for the new <CFID>-<NodeId>._matter._tcp name timed out (~35 s)
and commissioning failed with "something went wrong".
Fix: add a FabricTable::Delegate (SrpFabricDelegate) that overrides
OnFabricCommitted. It fires synchronously on the CHIP/OpenThread task
immediately after addNoc persists the new fabric, giving enough time to
update the SRP record before the CASE reconnect DNS-SD lookup is issued.
The delegate:
1. Clears any stale SRP service (otSrpClientClearHostAndServices — avoids
the OnSrpClientNotification crash fixed in the previous commit).
2. Calls trySrpServiceAdd() with the specific newly committed FabricIndex
so FindFabricWithIndex() picks the right fabric even when an older
fabric still occupies index 1.
trySrpServiceAdd() gains an optional preferIndex parameter for this path;
the onThreadStateChanged / startup paths continue to use the first-fabric
fallback as before.
The delegate is registered in the startup ScheduleWork after chip::Server
is initialised.
https://claude.ai/code/session_01SxqAiQApiRTwXefk9aQ6bR
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 22a77f48a0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // Use Clear (not RemoveHostAndServices) — see srpServiceRemove() for why. | ||
| if (s_srp.added) { | ||
| s_srp.added = false; | ||
| otSrpClientClearHostAndServices(ot); |
There was a problem hiding this comment.
Reapply SRP host config after clearing host/services
In SrpFabricDelegate::OnFabricCommitted, calling otSrpClientClearHostAndServices() drops the SRP host metadata along with services, but the follow-up path only calls trySrpServiceAdd() (which just adds a service). That means recommissioning cases where s_srp.added is already true can lose the hostname/host-address state set at startup and fail to publish the new operational _matter._tcp instance, so DNS-SD/CASE reconnect can still time out.
Useful? React with 👍 / 👎.
otSrpClientClearHostAndServices() resets the SRP client's host state (hostname pointer → null, auto-host-address mode disabled) in addition to removing services. The previous commit called it in SrpFabricDelegate::OnFabricCommitted but then only called trySrpServiceAdd(), which goes straight to otSrpClientAddService(). Without a configured host, the SRP client rejects AddService or silently fails to send an update, so the new <CFID>-<NodeId>._matter._tcp record never reaches OTBR and CASE reconnect still times out. Fix: - Extract setupSrpHost(otInstance*): sets hostname and re-enables auto-host-address mode using the buffer in s_srp. - Move the hostname buffer from a lambda-local static into SrpCtx (s_srp.hostname) so setupSrpHost() and the delegate share one buffer. - Call setupSrpHost() in OnFabricCommitted immediately after Clear and before trySrpServiceAdd(). - Startup lambda uses setupSrpHost() too, eliminating the duplicate snprintf/SetHostName/EnableAutoHostAddress block. https://claude.ai/code/session_01SxqAiQApiRTwXefk9aQ6bR
During commissioning the controller's "Reconnect" step (step 18) needs to resolve the device's NEW operational address via mDNS before it can open a CASE session. The mDNS record is sourced from the OTBR's SRP proxy.
Root cause: trySrpServiceAdd() ran when Thread first attached (via the onThreadStateChanged callback) and built the SRP instance name from the FIRST fabric in the table at that time (either an old committed fabric from a previous session, or nothing if no fabric existed yet). When addNoc subsequently committed a new fabric with a different CompressedFabricId / NodeId, the OTBR SRP record still carried the old instance name. DNS-SD resolution for the new -._matter._tcp name timed out (~35 s) and commissioning failed with "something went wrong".
Fix: add a FabricTable::Delegate (SrpFabricDelegate) that overrides OnFabricCommitted. It fires synchronously on the CHIP/OpenThread task immediately after addNoc persists the new fabric, giving enough time to update the SRP record before the CASE reconnect DNS-SD lookup is issued.
The delegate:
trySrpServiceAdd() gains an optional preferIndex parameter for this path; the onThreadStateChanged / startup paths continue to use the first-fabric fallback as before.
The delegate is registered in the startup ScheduleWork after chip::Server is initialised.