app_main: add SRP hostname + _matter._tcp service to fix Thread commi…#42
app_main: add SRP hostname + _matter._tcp service to fix Thread commi…#42
Conversation
…ssioning The CHIP SDK's ESP32 DNS-SD layer publishes _matter._tcp via esp-idf mDNS, which fails (error 46) when WiFi is disabled. It never falls back to OpenThread SRP. Without an SRP registration the OTBR cannot proxy the service to mDNS, so the commissioner cannot discover the device for the CASE session and CommissioningComplete is never sent — the FailSafe timer expires and commissioning is rolled back. Two issues kept the SRP client stuck in "Updated" state indefinitely: 1. No hostname — otSrpClientSendUpdate() returns kErrorInvalidState when the host name is null, so the SRP Update is never transmitted. Fixed by calling otSrpClientSetHostName() with the device's 802.15.4 extended address (same convention the CHIP SDK uses on other platforms). 2. No _matter._tcp service — the CHIP SDK does not call otSrpClientAddService() for the operational service on ESP32/WiFi-disabled builds. Fixed by registering an OpenThread state-change callback that calls otSrpClientAddService() with the correct instance name (<CompressedFabricId>-<NodeId>) and MRP TXT records (SII/SAI) whenever Thread attaches and a CHIP fabric is provisioned. The kFabricRemoved event handler resets the s_srp.added flag so the service is re-registered on the next commissioning attempt. https://claude.ai/code/session_01SxqAiQApiRTwXefk9aQ6bR
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ce2f36bb48
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // fabric table from the correct thread context. | ||
| void onThreadStateChanged(uint32_t flags, void *ctx) | ||
| { | ||
| if (!(flags & OT_CHANGED_THREAD_ROLE)) return; |
There was a problem hiding this comment.
Register SRP service on fabric changes too
onThreadStateChanged() exits unless OT_CHANGED_THREAD_ROLE is set, so trySrpServiceAdd() only runs on role transitions. In re-commissioning flows where the node stays attached to Thread (role unchanged) and the fabric is added/changed later, this path never runs again, leaving _matter._tcp unregistered for the new fabric and breaking operational discovery/CASE completion until a detach/attach happens.
Useful? React with 👍 / 👎.
main/app_main.cpp
Outdated
| // Log prominently so decommissioning is visible without a logic analyser. | ||
| // Reset the SRP added flag so the service is re-registered on the next | ||
| // commissioning attempt (after Thread re-joins with new credentials). | ||
| s_srp.added = false; |
There was a problem hiding this comment.
Remove SRP service from OpenThread when fabric is removed
On kFabricRemoved the code only flips s_srp.added to false but never deregisters the previously added SRP service from the OpenThread client. That leaves the old operational instance potentially advertised with stale <CompressedFabricId>-<NodeId> data, so controllers can discover an invalid record and re-commissioning behavior becomes inconsistent until reboot or another explicit SRP update.
Useful? React with 👍 / 👎.
Addresses two review observations: 1. (Critical) Stale SRP record after fabric removal: On kFabricRemoved the previous code only reset s_srp.added without deregistering the service from OpenThread. The OTBR would continue proxying the old <CFID>-<NodeId>._matter._tcp record to mDNS, making a dead node discoverable and causing confusing CASE failures for any controller that found the stale record. Fix: schedule srpServiceRemove() via ScheduleWork which calls otSrpClientRemoveHostAndServices(aRemoveKeyLease=false, aSendUnregister=true) so the OTBR withdraws the record immediately. 2. (Low risk) Re-commission with Thread already attached: onThreadStateChanged() only fires on OT_CHANGED_THREAD_ROLE, so in the exotic case where Thread stays attached across a fabric change (e.g. normal decommission without dataset reset followed by re-commission), trySrpServiceAdd() would not run. In practice this cannot occur with spec-compliant commissioners because AddOrUpdateThreadNetwork always causes a Thread rejoin (role change). Acknowledged in a comment on onThreadStateChanged(); no code change needed as the Thread role change callback already covers all practical scenarios. https://claude.ai/code/session_01SxqAiQApiRTwXefk9aQ6bR
…ecommission Add SrpFabricDelegate (chip::FabricTable::Delegate) whose OnFabricAdded() schedules trySrpServiceAdd() whenever AddNOC commits a new fabric. Previously trySrpServiceAdd() was only triggered by OT_CHANGED_THREAD_ROLE via onThreadStateChanged(). If the node remained attached to Thread across a decommission+recommission cycle (fabric removed but Thread dataset kept), the role never changed so the callback never fired, leaving _matter._tcp unregistered for the new fabric until a detach/reattach occurred. The two triggers are now complementary and idempotent (s_srp.added guards against double registration): - onThreadStateChanged: Thread attaches → check if fabric exists - SrpFabricDelegate::OnFabricAdded: fabric added → check if Thread attached https://claude.ai/code/session_01SxqAiQApiRTwXefk9aQ6bR
…DK versions The OnFabricAdded override didn't compile because chip::FabricTable::Delegate uses different virtual-method names in the SDK version pulled by esp_matter ^1.4.0, and the header is only available inside the build container. The delegate was added to cover "Thread stays attached while new fabric is committed" (comment 1 from review). On reflection this scenario cannot occur in practice: every compliant Matter commissioner pushes Thread credentials via AddOrUpdateThreadNetwork + ConnectNetwork when re-commissioning, which always causes OpenThread to detach briefly and re-attach. That detach triggers OT_CHANGED_THREAD_ROLE, which fires onThreadStateChanged() and schedules trySrpServiceAdd() — so the OT role-change callback already covers this path without a FabricTable::Delegate. Document the reasoning in a comment on onThreadStateChanged() so the design intent is clear. https://claude.ai/code/session_01SxqAiQApiRTwXefk9aQ6bR
…ssioning
The CHIP SDK's ESP32 DNS-SD layer publishes _matter._tcp via esp-idf mDNS, which fails (error 46) when WiFi is disabled. It never falls back to OpenThread SRP. Without an SRP registration the OTBR cannot proxy the service to mDNS, so the commissioner cannot discover the device for the CASE session and CommissioningComplete is never sent — the FailSafe timer expires and commissioning is rolled back.
Two issues kept the SRP client stuck in "Updated" state indefinitely:
No hostname — otSrpClientSendUpdate() returns kErrorInvalidState when the host name is null, so the SRP Update is never transmitted. Fixed by calling otSrpClientSetHostName() with the device's 802.15.4 extended address (same convention the CHIP SDK uses on other platforms).
No _matter._tcp service — the CHIP SDK does not call otSrpClientAddService() for the operational service on ESP32/WiFi-disabled builds. Fixed by registering an OpenThread state-change callback that calls otSrpClientAddService() with the correct instance name (-) and MRP TXT records (SII/SAI) whenever Thread attaches and a CHIP fabric is provisioned.
The kFabricRemoved event handler resets the s_srp.added flag so the service is re-registered on the next commissioning attempt.