Skip to content

app_main: add SRP hostname + _matter._tcp service to fix Thread commi…#42

Merged
GuvHas merged 4 commits intomainfrom
claude/review-esp32-thread-code-EsKdn
Mar 19, 2026
Merged

app_main: add SRP hostname + _matter._tcp service to fix Thread commi…#42
GuvHas merged 4 commits intomainfrom
claude/review-esp32-thread-code-EsKdn

Conversation

@GuvHas
Copy link
Copy Markdown
Owner

@GuvHas GuvHas commented Mar 19, 2026

…ssioning

The CHIP SDK's ESP32 DNS-SD layer publishes _matter._tcp via esp-idf mDNS, which fails (error 46) when WiFi is disabled. It never falls back to OpenThread SRP. Without an SRP registration the OTBR cannot proxy the service to mDNS, so the commissioner cannot discover the device for the CASE session and CommissioningComplete is never sent — the FailSafe timer expires and commissioning is rolled back.

Two issues kept the SRP client stuck in "Updated" state indefinitely:

  1. No hostname — otSrpClientSendUpdate() returns kErrorInvalidState when the host name is null, so the SRP Update is never transmitted. Fixed by calling otSrpClientSetHostName() with the device's 802.15.4 extended address (same convention the CHIP SDK uses on other platforms).

  2. No _matter._tcp service — the CHIP SDK does not call otSrpClientAddService() for the operational service on ESP32/WiFi-disabled builds. Fixed by registering an OpenThread state-change callback that calls otSrpClientAddService() with the correct instance name (-) and MRP TXT records (SII/SAI) whenever Thread attaches and a CHIP fabric is provisioned.

The kFabricRemoved event handler resets the s_srp.added flag so the service is re-registered on the next commissioning attempt.

…ssioning

The CHIP SDK's ESP32 DNS-SD layer publishes _matter._tcp via esp-idf mDNS,
which fails (error 46) when WiFi is disabled.  It never falls back to
OpenThread SRP.  Without an SRP registration the OTBR cannot proxy the
service to mDNS, so the commissioner cannot discover the device for the
CASE session and CommissioningComplete is never sent — the FailSafe timer
expires and commissioning is rolled back.

Two issues kept the SRP client stuck in "Updated" state indefinitely:

1. No hostname — otSrpClientSendUpdate() returns kErrorInvalidState when
   the host name is null, so the SRP Update is never transmitted.  Fixed
   by calling otSrpClientSetHostName() with the device's 802.15.4 extended
   address (same convention the CHIP SDK uses on other platforms).

2. No _matter._tcp service — the CHIP SDK does not call otSrpClientAddService()
   for the operational service on ESP32/WiFi-disabled builds.  Fixed by
   registering an OpenThread state-change callback that calls
   otSrpClientAddService() with the correct instance name
   (<CompressedFabricId>-<NodeId>) and MRP TXT records (SII/SAI) whenever
   Thread attaches and a CHIP fabric is provisioned.

The kFabricRemoved event handler resets the s_srp.added flag so the service
is re-registered on the next commissioning attempt.

https://claude.ai/code/session_01SxqAiQApiRTwXefk9aQ6bR
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ce2f36bb48

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

// fabric table from the correct thread context.
void onThreadStateChanged(uint32_t flags, void *ctx)
{
if (!(flags & OT_CHANGED_THREAD_ROLE)) return;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Register SRP service on fabric changes too

onThreadStateChanged() exits unless OT_CHANGED_THREAD_ROLE is set, so trySrpServiceAdd() only runs on role transitions. In re-commissioning flows where the node stays attached to Thread (role unchanged) and the fabric is added/changed later, this path never runs again, leaving _matter._tcp unregistered for the new fabric and breaking operational discovery/CASE completion until a detach/attach happens.

Useful? React with 👍 / 👎.

// Log prominently so decommissioning is visible without a logic analyser.
// Reset the SRP added flag so the service is re-registered on the next
// commissioning attempt (after Thread re-joins with new credentials).
s_srp.added = false;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Remove SRP service from OpenThread when fabric is removed

On kFabricRemoved the code only flips s_srp.added to false but never deregisters the previously added SRP service from the OpenThread client. That leaves the old operational instance potentially advertised with stale <CompressedFabricId>-<NodeId> data, so controllers can discover an invalid record and re-commissioning behavior becomes inconsistent until reboot or another explicit SRP update.

Useful? React with 👍 / 👎.

claude added 3 commits March 19, 2026 17:08
Addresses two review observations:

1. (Critical) Stale SRP record after fabric removal:
   On kFabricRemoved the previous code only reset s_srp.added without
   deregistering the service from OpenThread.  The OTBR would continue
   proxying the old <CFID>-<NodeId>._matter._tcp record to mDNS, making
   a dead node discoverable and causing confusing CASE failures for any
   controller that found the stale record.
   Fix: schedule srpServiceRemove() via ScheduleWork which calls
   otSrpClientRemoveHostAndServices(aRemoveKeyLease=false, aSendUnregister=true)
   so the OTBR withdraws the record immediately.

2. (Low risk) Re-commission with Thread already attached:
   onThreadStateChanged() only fires on OT_CHANGED_THREAD_ROLE, so in the
   exotic case where Thread stays attached across a fabric change (e.g. normal
   decommission without dataset reset followed by re-commission), trySrpServiceAdd()
   would not run.  In practice this cannot occur with spec-compliant commissioners
   because AddOrUpdateThreadNetwork always causes a Thread rejoin (role change).
   Acknowledged in a comment on onThreadStateChanged(); no code change needed
   as the Thread role change callback already covers all practical scenarios.

https://claude.ai/code/session_01SxqAiQApiRTwXefk9aQ6bR
…ecommission

Add SrpFabricDelegate (chip::FabricTable::Delegate) whose OnFabricAdded()
schedules trySrpServiceAdd() whenever AddNOC commits a new fabric.

Previously trySrpServiceAdd() was only triggered by OT_CHANGED_THREAD_ROLE
via onThreadStateChanged().  If the node remained attached to Thread across
a decommission+recommission cycle (fabric removed but Thread dataset kept),
the role never changed so the callback never fired, leaving _matter._tcp
unregistered for the new fabric until a detach/reattach occurred.

The two triggers are now complementary and idempotent (s_srp.added guards
against double registration):
  - onThreadStateChanged: Thread attaches → check if fabric exists
  - SrpFabricDelegate::OnFabricAdded: fabric added → check if Thread attached

https://claude.ai/code/session_01SxqAiQApiRTwXefk9aQ6bR
…DK versions

The OnFabricAdded override didn't compile because chip::FabricTable::Delegate
uses different virtual-method names in the SDK version pulled by esp_matter
^1.4.0, and the header is only available inside the build container.

The delegate was added to cover "Thread stays attached while new fabric is
committed" (comment 1 from review).  On reflection this scenario cannot occur
in practice: every compliant Matter commissioner pushes Thread credentials via
AddOrUpdateThreadNetwork + ConnectNetwork when re-commissioning, which always
causes OpenThread to detach briefly and re-attach.  That detach triggers
OT_CHANGED_THREAD_ROLE, which fires onThreadStateChanged() and schedules
trySrpServiceAdd() — so the OT role-change callback already covers this path
without a FabricTable::Delegate.

Document the reasoning in a comment on onThreadStateChanged() so the design
intent is clear.

https://claude.ai/code/session_01SxqAiQApiRTwXefk9aQ6bR
@GuvHas GuvHas merged commit 64d5306 into main Mar 19, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants