Skip to content

tigera-operator permanent reconciliation storm and simultaneous calico-node Typha disconnect after upgrading RKE2 v1.26.7 to v1.27.16 (Calico v3.26.1 to v3.27.3) #4615

@Sangam-ghimire

Description

@Sangam-ghimire

Environmental Info:

RKE2 Version: v1.27.16+rke2r2 (b6cf44ff1085da968b2ae84049b3aca22b93d7ef)

Node(s) CPU architecture, OS, and Version: Linux x86_64, Ubuntu 24.04, kernel 6.8.0-90-generic

Cluster Configuration: 3 server nodes, 27 agent nodes (30 nodes total). CNI: Calico. Upgraded from RKE2 v1.26.7+rke2r1 (Calico v3.26.1, tigera-operator v1.30.4) to RKE2 v1.27.16+rke2r2 (Calico v3.27.3, tigera-operator v1.32.7).

Describe the bug:

After upgrading from RKE2 v1.26.7+rke2r1 to v1.27.16+rke2r2, two related issues occur:

  1. tigera-operator enters a permanent reconciliation storm, reconciling calico-node, calico-cni-plugin, and calico-kube-controllers at ~150 loops/minute continuously.
    The calico-node daemonset has been rolled to generation 2513 since the upgrade.

  2. All 30 calico-node pods simultaneously lose their Typha connection, all Felix instances across all 3 Typha pods disconnect at the exact same millisecond, causing a cluster-wide network outage.
    Felix then requires approx~16 minutes to complete a full dataplane resync, during which the kubelet-to-pod network path is broken on all nodes simultaneously and all application healthchecks fail.

Steps To Reproduce:

  • Installed RKE2: v1.26.7+rke2r1 running with Calico v3.26.1 and tigera-operator v1.30.4
  • Upgrade cluster to RKE2 v1.27.16+rke2r2 which upgrades Calico to v3.27.3 and tigera-operator to v1.32.7
  • Observe tigera-operator immediately begins reconciling all calico Installation components at ~150 loops/minute
  • Periodically the reconciliation successfully applies a daemonset update — all 30 calico-node pods disconnect from Typha simultaneously despite maxUnavailable: 1 being set
  • All Felix instances reconnect but take ~16 minutes to fully resync dataplane
  • Cluster-wide network outage during resync window

Expected behavior:

After upgrading RKE2, tigera-operator should reconcile calico-node to its desired state once and stop. If a daemonset update is required, it should respect maxUnavailable: 1 and roll pods one at a time. A calico-node pod restart should not cause cluster-wide network disruption.

Actual behavior:

tigera-operator reconciles continuously at ~150 loops/minute and never converges. When reconciliation successfully applies a daemonset update, all 30 calico-node pods lose their Typha connections simultaneously, confirmed across all 3 Typha pods, despite maxUnavailable: 1 being configured on both the Installation CR and the daemonset updateStrategy.

Typha logs showing all 30 Felix clients disconnecting within the same second across all 3 Typha pods:

2026-03-31 05:08:53.514 sync_server.go: Client closed the connection. client=10.232.7.227 type="felix"
2026-03-31 05:08:53.515 sync_server.go: Client closed the connection. client=10.232.7.222 type="felix"
2026-03-31 05:08:53.521 sync_server.go: Client closed the connection. client=10.232.7.109 type="felix"
2026-03-31 05:08:53.523 sync_server.go: Client closed the connection. client=10.232.7.100 type="felix"
[all 30 nodes within same second across all 3 Typha pods]

tigera-operator reconciliation storm:

{"logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Name":"calico-node"}
{"logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Name":"calico-cni-plugin"}
{"logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Name":"calico-kube-controllers"}
[repeating ~150 times/minute continuously]

Felix resync completing ~16 minutes after disconnect:

2026-03-31 05:25:09 felix/health.go: Overall health status changed: live=true ready=true

daemonset updateStrategy:

{"rollingUpdate": {"maxSurge": 0, "maxUnavailable": 1}, "type": "RollingUpdate"}

tigerastatus after recovery:

NAME     AVAILABLE   PROGRESSING   DEGRADED   SINCE
calico   True        False         False      6h47m

Additional context / logs:

  • calico-node daemonset is at generation 2513 — rolled thousands of times since upgrade due to reconciliation storm
  • Simultaneous Typha disconnect confirmed independently across all 3 Typha pods at the same millisecond — ruling out a single Typha pod crash
  • maxUnavailable: 1 is set on both Installation CR spec.nodeUpdateStrategy and daemonset spec.updateStrategy — yet all pods disconnect simultaneously
  • Installation CR spec and status.computed are identical — operator should detect no drift yet continues reconciling at ~150 loops/minute
  • etcd healthy, all 3 members in sync, no leader elections during incident
  • Running kubectl rollout restart daemonset/calico-node as recovery after each outage added a restartedAt annotation causing additional operator conflict errors — removing this annotation stopped conflicts but reconciliation storm continues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions