tigera-operator permanent reconciliation storm and simultaneous calico-node Typha disconnect after upgrading RKE2 v1.26.7 to v1.27.16 (Calico v3.26.1 to v3.27.3)

**Environmental Info:**

RKE2 Version: v1.27.16+rke2r2 (b6cf44ff1085da968b2ae84049b3aca22b93d7ef)

Node(s) CPU architecture, OS, and Version: Linux x86_64, Ubuntu 24.04, kernel 6.8.0-90-generic

Cluster Configuration: 3 server nodes, 27 agent nodes (30 nodes total). CNI: Calico. Upgraded from RKE2 v1.26.7+rke2r1 (Calico v3.26.1, tigera-operator v1.30.4) to RKE2 v1.27.16+rke2r2 (Calico v3.27.3, tigera-operator v1.32.7).

**Describe the bug:**

After upgrading from RKE2 v1.26.7+rke2r1 to v1.27.16+rke2r2, two related issues occur:

1. tigera-operator enters a permanent reconciliation storm,  reconciling calico-node, calico-cni-plugin, and calico-kube-controllers at ~150 loops/minute continuously.
The calico-node daemonset has been rolled to generation 2513 since the upgrade.

2. All 30 calico-node pods simultaneously lose their Typha connection, all Felix instances across all 3 Typha pods disconnect at the exact same millisecond, causing a cluster-wide network outage. 
Felix then requires  approx~16 minutes to complete a full dataplane resync, during which the kubelet-to-pod network path is broken on all nodes simultaneously and all application healthchecks fail.

**Steps To Reproduce:**

- Installed RKE2: v1.26.7+rke2r1 running with Calico v3.26.1 and tigera-operator v1.30.4
- Upgrade cluster to RKE2 v1.27.16+rke2r2 which upgrades Calico to v3.27.3 and tigera-operator to v1.32.7
- Observe tigera-operator immediately begins reconciling all calico Installation components at ~150 loops/minute
- Periodically the reconciliation successfully applies a daemonset update — all 30 calico-node pods disconnect from Typha simultaneously despite maxUnavailable: 1 being set
- All Felix instances reconnect but take ~16 minutes to fully resync dataplane
- Cluster-wide network outage during resync window

**Expected behavior:**

After upgrading RKE2, tigera-operator should reconcile calico-node to its desired state once and stop. If a daemonset update is required, it should respect maxUnavailable: 1 and roll pods one at a time. A calico-node pod restart should not cause cluster-wide network disruption.

**Actual behavior:**

tigera-operator reconciles continuously at ~150 loops/minute and never converges. When reconciliation successfully applies a daemonset update, all 30 calico-node pods lose their Typha connections simultaneously, confirmed across all 3 Typha pods, despite maxUnavailable: 1 being configured on both the Installation CR and the daemonset updateStrategy.

Typha logs showing all 30 Felix clients disconnecting within the same second across all 3 Typha pods:

```
2026-03-31 05:08:53.514 sync_server.go: Client closed the connection. client=10.232.7.227 type="felix"
2026-03-31 05:08:53.515 sync_server.go: Client closed the connection. client=10.232.7.222 type="felix"
2026-03-31 05:08:53.521 sync_server.go: Client closed the connection. client=10.232.7.109 type="felix"
2026-03-31 05:08:53.523 sync_server.go: Client closed the connection. client=10.232.7.100 type="felix"
[all 30 nodes within same second across all 3 Typha pods]
```

tigera-operator reconciliation storm:

```
{"logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Name":"calico-node"}
{"logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Name":"calico-cni-plugin"}
{"logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Name":"calico-kube-controllers"}
[repeating ~150 times/minute continuously]
```

Felix resync completing ~16 minutes after disconnect:

```
2026-03-31 05:25:09 felix/health.go: Overall health status changed: live=true ready=true
```

daemonset updateStrategy:

```json
{"rollingUpdate": {"maxSurge": 0, "maxUnavailable": 1}, "type": "RollingUpdate"}
```

tigerastatus after recovery:

```
NAME     AVAILABLE   PROGRESSING   DEGRADED   SINCE
calico   True        False         False      6h47m
```

**Additional context / logs:**

- calico-node daemonset is at generation 2513 — rolled thousands of times since upgrade due to reconciliation storm
- Simultaneous Typha disconnect confirmed independently across all 3 Typha pods at the same millisecond — ruling out a single Typha pod crash
- maxUnavailable: 1 is set on both Installation CR spec.nodeUpdateStrategy and daemonset spec.updateStrategy — yet all pods disconnect simultaneously
- Installation CR spec and status.computed are identical — operator should detect no drift yet continues reconciling at ~150 loops/minute
- etcd healthy, all 3 members in sync, no leader elections during incident
- Running kubectl rollout restart daemonset/calico-node as recovery after each outage added a restartedAt annotation causing additional operator conflict errors — removing this annotation stopped conflicts but reconciliation storm continues

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tigera-operator permanent reconciliation storm and simultaneous calico-node Typha disconnect after upgrading RKE2 v1.26.7 to v1.27.16 (Calico v3.26.1 to v3.27.3) #4615

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

tigera-operator permanent reconciliation storm and simultaneous calico-node Typha disconnect after upgrading RKE2 v1.26.7 to v1.27.16 (Calico v3.26.1 to v3.27.3) #4615

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions