feat(monitor): operator Prometheus metrics with mTLS#4558
feat(monitor): operator Prometheus metrics with mTLS#4558rene-dekker wants to merge 7 commits intotigera:masterfrom
Conversation
pkg/render/monitor/monitor.go
Outdated
| rules = append(rules, | ||
| monitoringv1.Rule{ | ||
| Alert: "TLSCertExpiringWarning", | ||
| Expr: intstr.FromString("tigera_operator_tls_certificate_expiry_timestamp_seconds - time() < 29 * 24 * 3600"), |
There was a problem hiding this comment.
I picked 29 days, because we automatically renew our own certs 30d before expiry. I wanted to exclude that day to generate fewer alerts.
cmd/main.go
Outdated
| MinVersion: tls.VersionTLS12, | ||
| }, nil | ||
| } | ||
| cfg.MinVersion = tls.VersionTLS12 |
There was a problem hiding this comment.
Do we need a configurable min TLS version (and configurable client cert checking?)
I think we have that in the other metrics endpoints (at least for OSS Calico)
There was a problem hiding this comment.
Good idea. I will use the same code as OSS for both the TLS version and for the cert verification checks.
cmd/main.go
Outdated
|
|
||
| // metricsEnabled returns true when the operator metrics endpoint is enabled. | ||
| func metricsEnabled() bool { | ||
| return strings.EqualFold(os.Getenv("METRICS_ENABLED"), "true") |
There was a problem hiding this comment.
Technically a breaking change, since someone might be using metrics today without this set?
But probably not a big deal - easy fix.
There was a problem hiding this comment.
Yes, my rationale was that it is better to be explicit in the enablement, it can be reverted if desired. I would not think that there are that many users of prometheus as there were no metrics other than the out-of-the-box ones before this PR.
cmd/main.go
Outdated
| } | ||
|
|
||
| // metricsTLSEnabled returns true when the operator metrics endpoint should use mTLS. | ||
| func metricsTLSEnabled() bool { |
There was a problem hiding this comment.
might be worth a metrics.go in the main package to put these helpers in to keep main.go a bit less of a "God file"
cmd/main.go
Outdated
| // dynamicCertLoader dynamically loads TLS certificates from Kubernetes secrets | ||
| // for the metrics endpoint. The monitor controller creates the server cert, and | ||
| // the client CA is loaded from the Prometheus client TLS secret. | ||
| type dynamicCertLoader struct { |
There was a problem hiding this comment.
I wonder, do we need all of this complexity?
Couldn't we just use an optional secret file mount, and have kubelet auto-load changes to the mounted paths?
There was a problem hiding this comment.
I agree it is a better approach, even though it will cause a restart of the container. Based on my reading the kubelet may take up to a minute to trigger the restart, so for the first little bit after deploying the operator you would not have metrics, but I think for overall simplicity it is better, will do. Thanks.
pkg/controller/metrics/collectors.go
Outdated
|
|
||
| func conditionLabel(ct operatorv1.StatusConditionType) string { | ||
| switch ct { | ||
| case operatorv1.ComponentAvailable: |
There was a problem hiding this comment.
Isn't this just effectively strings.ToLower(ct)?
pkg/controller/metrics/collectors.go
Outdated
| continue | ||
| } | ||
|
|
||
| issuer := s.Annotations["certificates.operator.tigera.io/issuer"] |
There was a problem hiding this comment.
Do we have a const we can use? Should we have one?
| } | ||
|
|
||
| // metricsEnabled returns true when the operator metrics endpoint is enabled. | ||
| func metricsEnabled() bool { |
There was a problem hiding this comment.
Looks like a duplicate of the one in the main package - would be nice to have a common impl.
pkg/render/monitor/monitor.go
Outdated
| bearerTokenFile = "/var/run/secrets/kubernetes.io/serviceaccount/token" | ||
| KubeControllerMetrics = "calico-kube-controllers-metrics" | ||
|
|
||
| OperatorMetricsSecretName = "tigera-operator-tls" |
There was a problem hiding this comment.
It's a bit confusing that this is "tigera-operator-tls" but the var is "metrics" focused - I think it is correct for the secret name to NOT include the word metrics, but perhaps the variable name should match.
Would be worth some comments explaining as well. If we do end up using the cert in the future for anything else, we might need to move its creation out of this controller?
pkg/render/monitor/monitor.go
Outdated
| ObjectMeta: metav1.ObjectMeta{ | ||
| Name: OperatorMetricsServiceName, | ||
| Namespace: common.TigeraPrometheusNamespace, | ||
| Labels: map[string]string{"team": "network-operators"}, |
There was a problem hiding this comment.
I think this stems all the way back from when everything was manifest based installed. It may even have come from the prometheus example bundle. I think we should delete them all, everywhere probably.
cmd/main.go
Outdated
| MinVersion: tls.VersionTLS12, | ||
| }, nil | ||
| } | ||
| cfg.MinVersion = tls.VersionTLS12 |
There was a problem hiding this comment.
Good idea. I will use the same code as OSS for both the TLS version and for the cert verification checks.
cmd/main.go
Outdated
| // dynamicCertLoader dynamically loads TLS certificates from Kubernetes secrets | ||
| // for the metrics endpoint. The monitor controller creates the server cert, and | ||
| // the client CA is loaded from the Prometheus client TLS secret. | ||
| type dynamicCertLoader struct { |
There was a problem hiding this comment.
I agree it is a better approach, even though it will cause a restart of the container. Based on my reading the kubelet may take up to a minute to trigger the restart, so for the first little bit after deploying the operator you would not have metrics, but I think for overall simplicity it is better, will do. Thanks.
pkg/common/operator_namespace.go
Outdated
| return name | ||
| } | ||
|
|
||
| // MetricsEnabled returns true when the operator metrics endpoint is enabled via METRICS_ENABLED=true. |
There was a problem hiding this comment.
Bit of a nit, but shouldn't these be in a metrics locale instead of "operator_namespace.go"?
cmd/metrics.go
Outdated
|
|
||
| // ParseTLSVersion parses TLS version string and returns the corresponding tls version constant. | ||
| // Accepts: "1.2", "1.3", or empty string (defaults to "1.2"). | ||
| func ParseTLSVersion(version string) (uint16, error) { |
There was a problem hiding this comment.
Might be worth a tls package for the utils?
pkg/controller/metrics/collectors.go
Outdated
| func (c *OperatorCollector) collectLicense(ctx context.Context, ch chan<- prometheus.Metric) { | ||
| license, err := utils.FetchLicenseKey(ctx, c.client) | ||
| if err != nil { | ||
| // License not available (e.g., Calico OSS). Skip gracefully. |
There was a problem hiding this comment.
I think this is no longer correct, this comment? Because of the aforementioned enterprise check?
Add operator metrics endpoint with configurable mTLS via METRICS_SCHEME, METRICS_HOST, and METRICS_PORT env vars. The monitor controller creates a server cert, Service, and ServiceMonitor for Prometheus integration. Client auth trusts the tigera-ca-private CA rather than individual leaf certs. Includes a custom Prometheus collector for operator status gauges. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the implicit metrics-enabled detection (METRICS_HOST/PORT set) with an explicit METRICS_ENABLED=true env var. Default METRICS_HOST to 0.0.0.0 and METRICS_PORT to 8484 when enabled. Log a helpful message when mTLS is enabled but the server certificate is not yet available. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use port 9484 instead of 8484 to reduce the chance of conflicts on host-networked nodes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Set SecureServing: true so controller-runtime actually serves TLS instead of plain HTTP when METRICS_SCHEME=https. Add egress rule in calicoSystemPrometheusPolicy to allow Prometheus to reach the operator metrics Service. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Gate the tigera-operator-tls keypair creation on METRICS_SCHEME=https rather than METRICS_ENABLED=true. The Service and ServiceMonitor are still created whenever metrics are enabled, but the TLS certificate is only needed when mTLS is active. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add METRICS_CLIENT_AUTH and TLS_MIN_VERSION env vars for configurable mTLS client auth and TLS version (aligned with calico-private) - Extract metrics helpers into cmd/metrics.go - Replace dynamic cert loader with file-based TLS loading via volume mounts - Move MetricsEnabled/MetricsTLSEnabled to pkg/common for shared use - Export certificate annotation constants from certificatemanagement package - Rename OperatorMetricsSecretName to OperatorTLSSecretName - Fix inconsistent method naming (serviceOperatorMetrics) - Simplify conditionLabel to strings.ToLower - Add debug log for unparseable cert expiry annotations - Skip license API calls on OSS clusters (check CRD at startup) - Remove legacy "team: network-operators" label from all ServiceMonitors - Build ./cmd/ package instead of single file Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Move MetricsEnabled/MetricsTLSEnabled to pkg/common/metrics.go - Move ParseTLSVersion and ParseClientAuthType to pkg/tls/tls.go - Fix stale comment on license error path Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
METRICS_HOST,METRICS_PORT, andMETRICS_SCHEMEenv varsMETRICS_SCHEME=https: server cert fromtigera-operator-tls, client auth truststigera-ca-privateCAoperator_installation_statusandoperator_tigera_statusgaugesTest plan
make buildpassesmake ut UT_DIR=./pkg/controller/metrics— 10 tests passmake ut UT_DIR=./pkg/render/monitor— 20 tests passmake ut UT_DIR=./pkg/controller/monitor— 17 tests passMETRICS_HOST/METRICS_PORTset, verify metrics scrapedMETRICS_SCHEME=https, verify mTLS handshake with Prometheus🤖 Generated with Claude Code
Example alerts:

Example metrics: