-
Notifications
You must be signed in to change notification settings - Fork 150
tigera-operator permanent reconciliation storm and simultaneous calico-node Typha disconnect after upgrading RKE2 v1.26.7 to v1.27.16 (Calico v3.26.1 to v3.27.3) #4615
Description
Environmental Info:
RKE2 Version: v1.27.16+rke2r2 (b6cf44ff1085da968b2ae84049b3aca22b93d7ef)
Node(s) CPU architecture, OS, and Version: Linux x86_64, Ubuntu 24.04, kernel 6.8.0-90-generic
Cluster Configuration: 3 server nodes, 27 agent nodes (30 nodes total). CNI: Calico. Upgraded from RKE2 v1.26.7+rke2r1 (Calico v3.26.1, tigera-operator v1.30.4) to RKE2 v1.27.16+rke2r2 (Calico v3.27.3, tigera-operator v1.32.7).
Describe the bug:
After upgrading from RKE2 v1.26.7+rke2r1 to v1.27.16+rke2r2, two related issues occur:
-
tigera-operator enters a permanent reconciliation storm, reconciling calico-node, calico-cni-plugin, and calico-kube-controllers at ~150 loops/minute continuously.
The calico-node daemonset has been rolled to generation 2513 since the upgrade. -
All 30 calico-node pods simultaneously lose their Typha connection, all Felix instances across all 3 Typha pods disconnect at the exact same millisecond, causing a cluster-wide network outage.
Felix then requires approx~16 minutes to complete a full dataplane resync, during which the kubelet-to-pod network path is broken on all nodes simultaneously and all application healthchecks fail.
Steps To Reproduce:
- Installed RKE2: v1.26.7+rke2r1 running with Calico v3.26.1 and tigera-operator v1.30.4
- Upgrade cluster to RKE2 v1.27.16+rke2r2 which upgrades Calico to v3.27.3 and tigera-operator to v1.32.7
- Observe tigera-operator immediately begins reconciling all calico Installation components at ~150 loops/minute
- Periodically the reconciliation successfully applies a daemonset update — all 30 calico-node pods disconnect from Typha simultaneously despite maxUnavailable: 1 being set
- All Felix instances reconnect but take ~16 minutes to fully resync dataplane
- Cluster-wide network outage during resync window
Expected behavior:
After upgrading RKE2, tigera-operator should reconcile calico-node to its desired state once and stop. If a daemonset update is required, it should respect maxUnavailable: 1 and roll pods one at a time. A calico-node pod restart should not cause cluster-wide network disruption.
Actual behavior:
tigera-operator reconciles continuously at ~150 loops/minute and never converges. When reconciliation successfully applies a daemonset update, all 30 calico-node pods lose their Typha connections simultaneously, confirmed across all 3 Typha pods, despite maxUnavailable: 1 being configured on both the Installation CR and the daemonset updateStrategy.
Typha logs showing all 30 Felix clients disconnecting within the same second across all 3 Typha pods:
2026-03-31 05:08:53.514 sync_server.go: Client closed the connection. client=10.232.7.227 type="felix"
2026-03-31 05:08:53.515 sync_server.go: Client closed the connection. client=10.232.7.222 type="felix"
2026-03-31 05:08:53.521 sync_server.go: Client closed the connection. client=10.232.7.109 type="felix"
2026-03-31 05:08:53.523 sync_server.go: Client closed the connection. client=10.232.7.100 type="felix"
[all 30 nodes within same second across all 3 Typha pods]
tigera-operator reconciliation storm:
{"logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Name":"calico-node"}
{"logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Name":"calico-cni-plugin"}
{"logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Name":"calico-kube-controllers"}
[repeating ~150 times/minute continuously]
Felix resync completing ~16 minutes after disconnect:
2026-03-31 05:25:09 felix/health.go: Overall health status changed: live=true ready=true
daemonset updateStrategy:
{"rollingUpdate": {"maxSurge": 0, "maxUnavailable": 1}, "type": "RollingUpdate"}tigerastatus after recovery:
NAME AVAILABLE PROGRESSING DEGRADED SINCE
calico True False False 6h47m
Additional context / logs:
- calico-node daemonset is at generation 2513 — rolled thousands of times since upgrade due to reconciliation storm
- Simultaneous Typha disconnect confirmed independently across all 3 Typha pods at the same millisecond — ruling out a single Typha pod crash
- maxUnavailable: 1 is set on both Installation CR spec.nodeUpdateStrategy and daemonset spec.updateStrategy — yet all pods disconnect simultaneously
- Installation CR spec and status.computed are identical — operator should detect no drift yet continues reconciling at ~150 loops/minute
- etcd healthy, all 3 members in sync, no leader elections during incident
- Running kubectl rollout restart daemonset/calico-node as recovery after each outage added a restartedAt annotation causing additional operator conflict errors — removing this annotation stopped conflicts but reconciliation storm continues